|
|||
|
|
Table of Contents Introduction IntroductionJava is the programming language of choice for dynamic Internet content because of its portability, ease-of-use, and integration with HTML. Java is being used to enable animation on Web pages, to dynamically select and format Web page content at Web servers, and to provide client-side user input checking as a front-end to transaction-oriented applications. Java has also emerged as the language of choice for scripting 3-D motion in Virtual Reality Modeling Language (VRML) [HW96] worlds and for enabling multi-user interactions within these virtual environments [VRML97]. Despite these established uses for Java, the language has rarely been used to develop applications and servers that demand high performance or throughput. A primary reason cited by developers for not using Java in these situations is a perceived inability to write Java code that executes at speeds competitive with traditional compiled languages such as C or C++. Limited experience with garbage collection and JIT (Just-In-Time) compilation in large-scale applications also limit developers' comfort with Java performance. We have successfully overcome many of these concerns by employing a combination of careful program design and targeted optimization. In this paper, we describe techniques for developing high-throughput applications and servers in Java. We show that Java applications can be easily optimized to achieve performance that is comparable to code written in traditional languages such as C or C++. We base these results on our experience in developing the InVerse (Interactive Universe) server [S+97] in Java. Our measurements were generated on an unloaded 150MHz Pentium processor machine with 48MB of memory and running Windows 95. We used the Java 1.1.4 compiler and virtual machine provided by Sun with the -O flag to enable optimization. Timings were generated by executing the target code block in a tight loop and by using the System.currentTimeMillis() call to capture (wall clock) timing information. We also executed our tests using a Just-In-Time (JIT) code generator that translates the Java byte codes to native instructions at run time. The next several sections describe the primary performance issues faced by the Java application developer, along with approaches for optimizing those performance problems: memory access patterns, excessive synchronization, excessive object creation and garbage collection, excessive error checking, and reliance on a general-purpose class library. Although many of these issues are applicable to any object-oriented programming language, they are particularly germane to Java developers. We conclude by describing how optimizing these areas improved the overall performance of the InVerse server application and discussing our future work in this area. Memory Access PatternsWith the increasing divergence between processor speeds and memory access times, high-performance application developers are acutely aware of the need to minimize memory access. In compiled languages such as C++, the access time to all variables is fairly uniform because their memory locations are pre-computed, or, in the worst case, accessed through one level of indirection. However, Java application performance is noticeably affected by what types of variables are accessed and how they are accessed. For example, while stack variables are directly addressable (and may even be placed in registers), instance variables typically require an extra level of indirection to be accessed.
Table 1 illustrates the overhead for accessing local method (stack) variables, instance variables within the current class and superclasses, and static class variables. The table shows that reading and writing local method variables is much faster than any other type of variable, and accessing static variables is faster than accessing instance variables. With JIT, the difference is even more significant. Other access times vary too little to indicate any conclusive pattern. This memory access pattern implies the potential value of dynamic data location shifting, dynamically changing the storage location of data based on the access patterns. For example, a data-intensive operation would benefit from first copying instance variables into stack variables, operating on the stack variables, and, finally, copying the stack variables back to the permanent instance variables. This technique is particularly useful when a method variable is accessed repeatedly within a loop. For example, the common loop construct:
can be improved by 25% (5% with the JIT compiler) by rewriting it as
to reduce the number of accesses to the limit variable. A similar result was reported by [K97]. Excessive SynchronizationIn Java, methods and code blocks may be marked with the synchronized keyword. Multiple threads cannot simultaneously execute within any of the synchronized methods on the same class instance, and a synchronized code block is treated like a synchronized method on the object that is provided as an argument. To support this capability, the Java virtual machine links a monitor to each object having synchronized methods. Whenever a thread enters a synchronized block, it must first obtain a lock on the monitor for the associated object.
As shown in Table 2, synchronized methods can slow application execution by up to a factor of 20. The table shows the time required to call an empty method with various combinations of method synchronization and thread contention. In the case of thread contention, two threads are each simultaneously accessing the object's method in a tight loop. The table reveals that adding synchronization to a method degrades performance by a factor of between 8 and 11 even if the application is only executing a single thread (i.e. the thread could never block on the monitor). This result indicates that the synchronization overhead is primarily incurred in checking for the lock availability, rather than in actually obtaining the lock. Notably, the JIT compiler cannot improve the lock access and contention overhead at all because it has no information about the application's structure. Indeed, the performance overhead of the JIT compiler actually slows the program execution. In contrast, the JIT compiler improves performance slightly when the calls are unsynchronized. Synchronization impacts application performance in subtle ways. For example, the memory allocator is synchronized, meaning that object creation incurs a synchronization penalty. Many common functions in the Java class library are designed to be thread-safe and are, therefore, synchronized. For example, accessing an indexed element in a Vector requires a synchronized method call, as does calling the nextItem() on an associated Enumeration. In our experience, many collections are only accessed by a single thread, so this synchronization imposes considerable overhead for little benefit. To allow the application to selectively disable this synchronization, for example, we have subclassed the Java Vector class to provide an unsynchronized access method. Similar issues arise in the I/O stream library, where 86% of the overhead of writing formatted data through a DataOutputStream is attributable to the underlying synchronized calls to OutputStream methods. Where we could not eliminate synchronization, we occasionally improved performance by placing a series of synchronized calls inside a single synchronization block. For example, we can replace
with
The replacement code allows the Java virtual machine to short-circuit the method synchronization. It should only be used if the synchronization is short-lived (because it will otherwise block other threads) or if no thread contention exists. However, Java's caching of locks causes this code to exhibit variable performance. Excessive Object Creation and Garbage CollectionArmed with Java's built-in garbage collection, programmers find themselves free to no longer worry about how memory for their objects is allocated and whether it is explicitly released. Although it simplifies the task of object-oriented development, garbage collection can also cause significant performance problems. Programmers tend to create objects liberally because they perceive that they are relatively cheap to create.1 In effect, the Java virtual machine masks the costs of memory allocation and deallocation for objects. Object creation and destruction in Java is extremely expensive. Table 3 shows the execution time for a variety of object creation operations compared with the expense of extracting an item from a synchronized array, as might be done within a free object pool. The table reveals that object creation is considerably more expensive than the alternative of re-using existing objects. The difference is a factor of ten more significant because, in our experience, many transient classes are instantiated at program points that do not require synchronization around the object repository.
On compute-intensive applications such as servers, object creation has significant effects. The garbage collector is supposed to only execute when no other application threads are executing, thus minimally impacting an application's performance. However, because high-performance applications may not offer such idle periods, garbage collector performance must inherently steal time from the application execution. To reduce the number of objects created and destroyed, we have employed two techniques:
Both techniques rely upon a reinitialize() method in the optimized class. On occasion, we have found the need to optimize Java library classes that do not supply this method by implementing subclassed versions. Excessive Error CheckingThe Java virtual machine provides native instructions to support exception handling [LY97], in contrast to C++ which relies on compiler-generated instructions to explicitly store exception information on the execution stack and to locate the appropriate exception handler when an exception is thrown. These native instructions mean that exception handling in Java is relatively fast. As shown in Table 4, entering a try-catch clause imposes negligible overhead. The cost of actually handling a thrown exception is non-trivial, but that expense is only incurred in the exceptional case.
The near-zero cost of try-catch clauses suggests that we rely on exception handling instead of explicit error checking when errors are expected to be rare (i.e. the exception would rarely be thrown). Consider, for example, the task of indexing into an array:
An exception-oriented approach leads us to re-write the above code block in the following form:
The exception-oriented approach is 50% faster than the traditional approach in the (common) case that the index is within range. As long as the common case occurs sufficiently often relative to the exception condition, we achieve a performance improvement by eliminating the error check on each execution. This is particularly true in loops. Similar situations arise with type casting, using the instanceof operator versus catching ClassCastException. The performance characteristics of try-catch clauses lead us to re-define how exceptions are used in high-performance applications. Instead of simply using exceptions to signal errors, we use exceptions to signal uncommon occurrences, even if those occurrences do not represent error conditions. This approach [C96] is consistent with "making the common case fast." General-Purpose Class LibraryThe standard Java class library supports a broad variety of applications, ranging from simple Web applets to more complex systems. As will all software, however, this general-purpose software typically does not provide the best performance possible for specific uses. By recognizing opportunities for specializing the JDK's general constructs, applications can gain considerable performance benefits. These benefits arise from optimizing classes based on the application access patterns and optimizing classes based on their interactions with each other. Certain methods in the Java libraries are optimized at the expense of others, based on the designers' assumptions about application usage. If the application predominantly calls the unoptimized methods, then performance can be improved by shifting the class implementation to suit the application. For example, we have written a Hashtable with a linked list through the inserted elements to support fast sequential access. Like any good class library, the JDK relies heavily on interfaces that mask the underlying implementations. This approach provides modularity at the expense of overall library performance. For example, because classes must interact with each other through a general-purpose interface, they cannot take advantage of capabilities that are unique to their particular implementations. In cases where particular classes are interacting through such an interface, performance can be improved by specializing the interaction point. For example, we have merged ObjectOutputStream and ByteArrayOutputStream to provide fast, unsynchronized output to byte arrays. Conclusion and Future WorkIn developing our high-performance server, we have found the need to apply several specific optimizations that account for unique performance characteristics of the Java virtual machine and run-time system:
Although we had originally written the InVerse server to be quite efficient, these optimizations enabled us to further improve performance by between a factor of eight (for the least common cases) and a factor of 100 (for the common case code paths), as shown in Table 9. Based on our measurements, we expect that the JIT compiler will further enhance the impact of thse optimizations on the InVerse system performance. Probably most important, these impressive performance improvements did not require significant changes to the application's basic design. We did not have to compromise the object-oriented structure of our application or destroy the encapsulation of its various classes and components. These optimizations typically involved localized code modifications. It was quite rewarding to find that relatively ,simplistic peephole optimization could have such significant impact. Table 9: Java Optimizations on InVerse Performance (Packets/Second Processing Throughput) and Comparison to Zero-Processing Capabilities. These optimizations reveal that Java can be used to implement high-throughput servers and high-performance applications. For our server code, we are currently limited more by the throughput of our network hardware than the software processing. Indeed, as Table 9 indicates, InVerse can process packets faster than the network card can deliver and send data, even with a C++ receive loop. Ultimately, therefore, by applying the optimizations, we have effectively eliminated any negative impact of Java. Although future versions of the Java virtual machine will likely affect our performance results, many of our conclusions cannot easily be optimized without human intervention. For example, static analysis of thread behavior is virtual non-existent even in well-established languages. Though we plan to re-evaluate our results over time, we feel confident that many of them will be of lasting value. We are continuing to study the performance of Java to find ways for improving application performance. We are beginning to look at how function locality within class files may impact memory layout within the virtual machine and the consequent speed of program and data access. We are also experimenting with how identifier name lengths may impact the caching behavior of byte code instruction access and with additional techniques for improving performance by isolating common code paths from exception code. Finally, we are exploring the feasibility of incorporating these techniques into a performance analysis tool that might statically analyze application source code and determine opportunities for improving performance. Java is widely regarded as the enabling technology for delivering dynamic content to users over the Internet. We foresee considerable advantages in employing acommon implementation platform for the content and for the servers that deliver that content. The potential advantages of Java-aware servers--including creation and delivery of Java objects to clients, elimination of cross-language calling constructs and Object Request Brokers, and integration with data stored on heterogeneous platforms--are exciting. Our performance analyses indicate that this single language environment can be a reality.
AcknowledgementsWe would like to thank James Clinton for providing feedback on an earlier draft of this paper. References
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||