IBM ShopIBM Support Download
HomeNewsProductsservicesSolutionsAbout IBM

Search 

IBM : Developer : Java™ overview : Library - papers


Building High-Performance Applications and Servers in Java: An Experiential Study

Sandeep K. Singhal
IBM T.J. Watson Research Center
Binh Q. Nguyen, Richard Redpath, Michael Fraenkel, Jimmy Nguyen
IBM Software Solutions
July, 1997




Table of Contents

Introduction
Memory Access Patterns
Excessive Synchronization
Excessive Object Creation and Garbage Collection
Excessive Error Checking
General-Purpose Class Library
Conclusion and Future Work
Acknowledgements
References

Introduction

Java is the programming language of choice for dynamic Internet content because of its portability, ease-of-use, and integration with HTML. Java is being used to enable animation on Web pages, to dynamically select and format Web page content at Web servers, and to provide client-side user input checking as a front-end to transaction-oriented applications. Java has also emerged as the language of choice for scripting 3-D motion in Virtual Reality Modeling Language (VRML) [HW96] worlds and for enabling multi-user interactions within these virtual environments [VRML97].

Despite these established uses for Java, the language has rarely been used to develop applications and servers that demand high performance or throughput. A primary reason cited by developers for not using Java in these situations is a perceived inability to write Java code that executes at speeds competitive with traditional compiled languages such as C or C++. Limited experience with garbage collection and JIT (Just-In-Time) compilation in large-scale applications also limit developers' comfort with Java performance.

We have successfully overcome many of these concerns by employing a combination of careful program design and targeted optimization. In this paper, we describe techniques for developing high-throughput applications and servers in Java. We show that Java applications can be easily optimized to achieve performance that is comparable to code written in traditional languages such as C or C++. We base these results on our experience in developing the InVerse (Interactive Universe) server [S+97] in Java.

Our measurements were generated on an unloaded 150MHz Pentium processor machine with 48MB of memory and running Windows 95. We used the Java 1.1.4 compiler and virtual machine provided by Sun with the -O flag to enable optimization. Timings were generated by executing the target code block in a tight loop and by using the System.currentTimeMillis() call to capture (wall clock) timing information. We also executed our tests using a Just-In-Time (JIT) code generator that translates the Java byte codes to native instructions at run time.

The next several sections describe the primary performance issues faced by the Java application developer, along with approaches for optimizing those performance problems: memory access patterns, excessive synchronization, excessive object creation and garbage collection, excessive error checking, and reliance on a general-purpose class library. Although many of these issues are applicable to any object-oriented programming language, they are particularly germane to Java developers. We conclude by describing how optimizing these areas improved the overall performance of the InVerse server application and discussing our future work in this area.

Memory Access Patterns

With the increasing divergence between processor speeds and memory access times, high-performance application developers are acutely aware of the need to minimize memory access. In compiled languages such as C++, the access time to all variables is fairly uniform because their memory locations are pre-computed, or, in the worst case, accessed through one level of indirection. However, Java application performance is noticeably affected by what types of variables are accessed and how they are accessed. For example, while stack variables are directly addressable (and may even be placed in registers), instance variables typically require an extra level of indirection to be accessed.

Table 1: (Normalized) Access Times for Different Variable Types and Modes
  Read Write
  Without JIT
Local method variable
Object instance variable
Superclass instance var.
Supersuperclass inst. var.
Class static variable
1.00
1.92
1.96
1.83
1.23
1.00
2.14
2.14
2.14
1.52
  With JIT
Local method variable
Object instance variable
Superclass instance var.
Supersuperclass inst. var.
Class static variable
1.00
1.64
1.73
1.73
1.23
1.00
1.99
2.14
2.21
1.62

Table 1 illustrates the overhead for accessing local method (stack) variables, instance variables within the current class and superclasses, and static class variables. The table shows that reading and writing local method variables is much faster than any other type of variable, and accessing static variables is faster than accessing instance variables. With JIT, the difference is even more significant. Other access times vary too little to indicate any conclusive pattern.

This memory access pattern implies the potential value of dynamic data location shifting, dynamically changing the storage location of data based on the access patterns.

For example, a data-intensive operation would benefit from first copying instance variables into stack variables, operating on the stack variables, and, finally, copying the stack variables back to the permanent instance variables.

This technique is particularly useful when a method variable is accessed repeatedly within a loop. For example, the common loop construct:

for (int i=0; ++i<=limit; )

can be improved by 25% (5% with the JIT compiler) by rewriting it as

for (int i=limit; --i>=0; )

to reduce the number of accesses to the limit variable. A similar result was reported by [K97].

Excessive Synchronization

In Java, methods and code blocks may be marked with the synchronized keyword. Multiple threads cannot simultaneously execute within any of the synchronized methods on the same class instance, and a synchronized code block is treated like a synchronized method on the object that is provided as an argument. To support this capability, the Java virtual machine links a monitor to each object having synchronized methods. Whenever a thread enters a synchronized block, it must first obtain a lock on the monitor for the associated object.

Table 2: Method Call Time (in Microseconds)With Various Combinations of Synchronization and Thread Contention
  Unsynch. Synch.
  Without JIT
No thread contention
Thread contention
0.54
0.55
3.96
8.87
  With JIT
No thread contention
Thread contention
0.50
0.57
5.55
11.00

As shown in Table 2, synchronized methods can slow application execution by up to a factor of 20. The table shows the time required to call an empty method with various combinations of method synchronization and thread contention. In the case of thread contention, two threads are each simultaneously accessing the object's method in a tight loop.

The table reveals that adding synchronization to a method degrades performance by a factor of between 8 and 11 even if the application is only executing a single thread (i.e. the thread could never block on the monitor). This result indicates that the synchronization overhead is primarily incurred in checking for the lock availability, rather than in actually obtaining the lock.

Notably, the JIT compiler cannot improve the lock access and contention overhead at all because it has no information about the application's structure. Indeed, the performance overhead of the JIT compiler actually slows the program execution. In contrast, the JIT compiler improves performance slightly when the calls are unsynchronized.

Synchronization impacts application performance in subtle ways. For example, the memory allocator is synchronized, meaning that object creation incurs a synchronization penalty.

Many common functions in the Java class library are designed to be thread-safe and are, therefore, synchronized. For example, accessing an indexed element in a Vector requires a synchronized method call, as does calling the nextItem() on an associated Enumeration. In our experience, many collections are only accessed by a single thread, so this synchronization imposes considerable overhead for little benefit. To allow the application to selectively disable this synchronization, for example, we have subclassed the Java Vector class to provide an unsynchronized access method. Similar issues arise in the I/O stream library, where 86% of the overhead of writing formatted data through a DataOutputStream is attributable to the underlying synchronized calls to OutputStream methods. Where we could not eliminate synchronization, we occasionally improved performance by placing a series of synchronized calls inside a single synchronization block. For example, we can replace

for (i=0; ibr>
  x.f(); // synchronized on x

with

synchronized(x) {
  for (i=0; ibr>
    x.f(); // synchronized on x }

The replacement code allows the Java virtual machine to short-circuit the method synchronization. It should only be used if the synchronization is short-lived (because it will otherwise block other threads) or if no thread contention exists. However, Java's caching of locks causes this code to exhibit variable performance.

Excessive Object Creation and Garbage Collection

Armed with Java's built-in garbage collection, programmers find themselves free to no longer worry about how memory for their objects is allocated and whether it is explicitly released. Although it simplifies the task of object-oriented development, garbage collection can also cause significant performance problems. Programmers tend to create objects liberally because they perceive that they are relatively cheap to create.1 In effect, the Java virtual machine masks the costs of memory allocation and deallocation for objects.

Object creation and destruction in Java is extremely expensive. Table 3 shows the execution time for a variety of object creation operations compared with the expense of extracting an item from a synchronized array, as might be done within a free object pool. The table reveals that object creation is considerably more expensive than the alternative of re-using existing objects. The difference is a factor of ten more significant because, in our experience, many transient classes are instantiated at program points that do not require synchronization around the object repository.

Table 3: Execution Time to Create Java Objects (in Microseconds)
  No With
  JIT JIT
Create object (no constructor)
Create object (constructor)
Create subclass (no construct.)
Create subclass (base class constructors)
12.57
12.97
12.79
12.80
12.36
13.45
12.47
12.53
(Synchronized) remove element from array 4.06 3.84

On compute-intensive applications such as servers, object creation has significant effects. The garbage collector is supposed to only execute when no other application threads are executing, thus minimally impacting an application's performance. However, because high-performance applications may not offer such idle periods, garbage collector performance must inherently steal time from the application execution.

To reduce the number of objects created and destroyed, we have employed two techniques:

  • When an object instance is allocated repeatedly, (e.g. within a particular method), declare a static variable to store the object between successive method iterations. Within the method, call a reinitialize() method on the stored object so that it behaves as if newly allocated.
  • When multiple instances of the object are active simultaneously and their lifetimes cannot be localized to a single part of the application, use a Vector or growable array to implement a free instance repository. When an object is no longer needed, a static method adds it to the free object pool. Similarly, to obtain a new instance of the class, another static method releases an available instance from the repository and invokes the reinitialize() method on it. The constructors are declared private so that the application cannot accidentally allocate new instances directly.

Both techniques rely upon a reinitialize() method in the optimized class. On occasion, we have found the need to optimize Java library classes that do not supply this method by implementing subclassed versions.

Excessive Error Checking

The Java virtual machine provides native instructions to support exception handling [LY97], in contrast to C++ which relies on compiler-generated instructions to explicitly store exception information on the execution stack and to locate the appropriate exception handler when an exception is thrown. These native instructions mean that exception handling in Java is relatively fast. As shown in Table 4, entering a try-catch clause imposes negligible overhead. The cost of actually handling a thrown exception is non-trivial, but that expense is only incurred in the exceptional case.

Table 4: Overhead of Java Exception Handling (Microseconds)
  No JIT JIT
Function call
Function call in try-catch (no exception thrown)
Function call in try-catch (exception thrown)
0.2
0.2
191.7
0.2
0.2
186.2
If-test array index bounds
Array dereference
Array dereference throwing bounds exception
0.5
0.4
1613.2
0.5
0.4
1548.9
If-test using instanceof
Object type cast
Object type cast throwing casting exception
0.8
0.7
1581.9
0.7
0.7
1469.8

The near-zero cost of try-catch clauses suggests that we rely on exception handling instead of explicit error checking when errors are expected to be rare (i.e. the exception would rarely be thrown). Consider, for example, the task of indexing into an array:

if ((idx >=0) && (idx < array.length))
  x = array[idx];
else
 // error

An exception-oriented approach leads us to re-write the above code block in the following form:

try {
  x = array[idx]
}
catch (ArrayOutOfBoundsException e) {
  // error
}

The exception-oriented approach is 50% faster than the traditional approach in the (common) case that the index is within range. As long as the common case occurs sufficiently often relative to the exception condition, we achieve a performance improvement by eliminating the error check on each execution. This is particularly true in loops. Similar situations arise with type casting, using the instanceof operator versus catching ClassCastException.

The performance characteristics of try-catch clauses lead us to re-define how exceptions are used in high-performance applications. Instead of simply using exceptions to signal errors, we use exceptions to signal uncommon occurrences, even if those occurrences do not represent error conditions. This approach [C96] is consistent with "making the common case fast."

General-Purpose Class Library

The standard Java class library supports a broad variety of applications, ranging from simple Web applets to more complex systems. As will all software, however, this general-purpose software typically does not provide the best performance possible for specific uses. By recognizing opportunities for specializing the JDK's general constructs, applications can gain considerable performance benefits. These benefits arise from optimizing classes based on the application access patterns and optimizing classes based on their interactions with each other.

Certain methods in the Java libraries are optimized at the expense of others, based on the designers' assumptions about application usage. If the application predominantly calls the unoptimized methods, then performance can be improved by shifting the class implementation to suit the application. For example, we have written a Hashtable with a linked list through the inserted elements to support fast sequential access.

Like any good class library, the JDK relies heavily on interfaces that mask the underlying implementations. This approach provides modularity at the expense of overall library performance. For example, because classes must interact with each other through a general-purpose interface, they cannot take advantage of capabilities that are unique to their particular implementations. In cases where particular classes are interacting through such an interface, performance can be improved by specializing the interaction point. For example, we have merged ObjectOutputStream and ByteArrayOutputStream to provide fast, unsynchronized output to byte arrays.

Conclusion and Future Work

In developing our high-performance server, we have found the need to apply several specific optimizations that account for unique performance characteristics of the Java virtual machine and run-time system:

  • Eliminate synchronization in application code and by re-implementing standard Java classes
  • Develop customized versions of standard Java classes to match the access patterns exhibited by our application
  • Merge Java classes when interactions can be optimized
  • Selectively use exception handling to create a fast-path for common case execution
  • Reuse object instances where possible and eliminate creation of transient instances
  • Cache object variable references in stack variable references where possible
Table 9: Java Optimizations on InVerse Performance (Packets/Second Processing Throughput) and Comparison to Zero-Processing Capabilities
    Before After
InVerse pkts/sec processed:
  1 User
10 Users
100 Users
145
99
N/A
810
505
147
Base limits (Java)
  1 User
10 Users
100 Users
930
93
9
Base limits (C)
  1 User
10 Users
100 Users
1080
108
11

Although we had originally written the InVerse server to be quite efficient, these optimizations enabled us to further improve performance by between a factor of eight (for the least common cases) and a factor of 100 (for the common case code paths), as shown in Table 9. Based on our measurements, we expect that the JIT compiler will further enhance the impact of thse optimizations on the InVerse system performance.

Probably most important, these impressive performance improvements did not require significant changes to the application's basic design. We did not have to compromise the object-oriented structure of our application or destroy the encapsulation of its various classes and components. These optimizations typically involved localized code modifications. It was quite rewarding to find that relatively ,simplistic peephole optimization could have such significant impact. Table 9: Java Optimizations on InVerse Performance (Packets/Second Processing Throughput) and Comparison to Zero-Processing Capabilities.

These optimizations reveal that Java can be used to implement high-throughput servers and high-performance applications. For our server code, we are currently limited more by the throughput of our network hardware than the software processing. Indeed, as Table 9 indicates, InVerse can process packets faster than the network card can deliver and send data, even with a C++ receive loop. Ultimately, therefore, by applying the optimizations, we have effectively eliminated any negative impact of Java.

Although future versions of the Java virtual machine will likely affect our performance results, many of our conclusions cannot easily be optimized without human intervention. For example, static analysis of thread behavior is virtual non-existent even in well-established languages. Though we plan to re-evaluate our results over time, we feel confident that many of them will be of lasting value.

We are continuing to study the performance of Java to find ways for improving application performance. We are beginning to look at how function locality within class files may impact memory layout within the virtual machine and the consequent speed of program and data access. We are also experimenting with how identifier name lengths may impact the caching behavior of byte code instruction access and with additional techniques for improving performance by isolating common code paths from exception code. Finally, we are exploring the feasibility of incorporating these techniques into a performance analysis tool that might statically analyze application source code and determine opportunities for improving performance.

Java is widely regarded as the enabling technology for delivering dynamic content to users over the Internet. We foresee considerable advantages in employing acommon implementation platform for the content and for the servers that deliver that content. The potential advantages of Java-aware servers--including creation and delivery of Java objects to clients, elimination of cross-language calling constructs and Object Request Brokers, and integration with data stored on heterogeneous platforms--are exciting. Our performance analyses indicate that this single language environment can be a reality.


1 On the other hand, C++ programmers actively seek to minimize the costs of malloc() and free() calls on heap memory by maintaining customized memory heaps, replace the system operator new and delete calls to control placement of objects within those heaps, and maintain free object pools that allow objects to be reused without re-invoking a constructor.


Acknowledgements

We would like to thank James Clinton for providing feedback on an earlier draft of this paper.

References

[AG96] K. Arnold and J. Gosling. The Java Programming Language (Reading, Massachusetts: Addison Wesley), 1996.
[C96] D. Cheriton. Object-Oriented Programming from a Modeling and Simulation Perspective. To appear..
[HW96] J. Hartman and J. Wernecke. The VRML 2.0 Handbook: Building Moving Worlds on the Web (Reading, Massachusetts: Addison-Wesley), 1996.
[K97] Eugene Koznetsov. "Optimizing Performance Execution with Java." Java Report 2.6 (June 1997), 49-51.
[LY97] T. Lindholm and F. Yellin. The Java™ Virtual Machine Specification (Reading, Massachusetts: Addison-Wesley), 1997.
[S+97] S. Singhal, B. Nguyen, R. Redpath, J. Nguyen, and M. Fraenkel. "InVerse: Designing an Interactive Universe Architecture for Scalability and Extensbility."
[VRML97] VRML Architecture Group. "Living Worlds: Specification and Comments." February 1997.

Back to Abstract



PrivacyLegalContact---