http://www.developer.com/

Back to article

Java Language Integrity & Security: Uncovering Bytecodes


March 5, 2007

This series, The Object-Oriented Thought Process, is intended for someone just learning an object-oriented language and who wants to understand the basic concepts before jumping into the code, or someone who wants to understand the infrastructure behind an object-oriented language he or she is already using. These concepts are part of the foundation that any programmer will need to make the paradigm shift from procedural programming to object-oriented programming.

Click here to start at the beginning of the series.

In keeping with the code examples used in the previous articles, Java will be the language used to implement the concepts in code. One of the reasons that I like to use Java is because you can download the Java compiler for personal use at the Sun Microsystems Web site http://java.sun.com/. You can download the standard edition, J2SE 5.0, at http://java.sun.com/j2se/1.5.0/download.jsp to compile and execute these applications. I often reference the Java J2SE 5.0 API documentation and I recommend that you explore the Java API further. Code listings are provided for all examples in this article as well as figures and output (when appropriate). See the first article in this series for detailed descriptions for compiling and running all the code examples.

In the previous column, you explored some of the behaviors of serialization and how it relates to the topics of performance and security. In this article, you will begin an exploration of the bytecodes that are produced when source files are compiled and how this affects performance and security. This path will lead you into some interesting discussions on how the bytecode is interpreted in relation to the Java Virtual Machine (JVM).

The code examples in this series are meant to be a hands-on experience. There are many code listings and figures of the output produced from these code examples. Please boot up your computer and run these exercises as you read through the text.

Inspecting Classes

The bytecode model provides many advantages; however, as always seems to be the case, there are some drawbacks as well. When a compiled language is used, and a statically linked executable is produced, the resulting machine code is quite difficult to reengineer.

Reengineering can mean many things, from re-creating the original design to reproducing the original source code. Although the previous sentence uses the words reengineering, re-create and reproduce, the Java documentation uses another word, disassemble. The Java toolkit actually provides a tool, called javap, for simple disassembly. The term disassemble can raise some eyebrows because at certain levels it is inappropriate; however, you will use the practice here in an instructional sense.

In languages that produce bytecodes, the practice of disassembling code has one goal in mind: Take the bytecodes and reverse-engineer them to produce source code that is effectively identical to the original source code.

Statically Linked Executables

However, decompiling code also has an educational benefit. Please return to the issue of the statically linked language. Languages such as C, C++, and FORTRAN go through a compile/link process that produces what is called a statically linked module. In a MS Windows environment, these models are sometimes referred to as executables and have an .EXE extension. Figure 1 shows the process by which statically linked executables are produced.



Click here for a larger image.

Figure 1: Statically Linked Applications

Note that the link process can accept multiple inputs, not just a single object module. This is what the term link means, a executable module can contain code that is 'linked' together from various places. For example, besides an object module produced from a single file, a developer can 'link' other modules, including those produced by other developers as well as libraries, possibly from third party vendors.

The other term that is pertinent here is 'statically.' All of the linked modules produce an executable that is static, not dynamic, as are the examples you will explore later. A search on Google finds a definition for statically linked as follows:

Definitions of statically linked on the Web:

Linked as a physical part of an executable file. The linkage between calls and subprograms is completely fixed at link time. See dynamically linked.

techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi

The operative part of this definition is the part that says: The linkage between calls and subprograms is completely fixed at link time. It is also interesting to see that the term dynamically linked is part of the definition, as an opposite. The issue is that in a statically linked executable, everything is pre-determined. This has its advantages; it also has its disadvantages. The primary advantage is that everything you need is always there; the disadvantage is that everything you need is always there.

This issue is as the heart of a major hindrance when it comes to size. As you may imagine, sending a large, statically linked executable over a network poses a significant problem. As an example, if you are loading a module over a network, it may be a good idea to only download the functionality that you need. This is one of the problems with a statically linked executable. If everything is part of the package, including the kitchen sink, what happens when you don't need the kitchen sink? The more basic question is, Why send the kitchen sink over the network if you don't even want it?

As you understand, a statically linked executable, such as a Microsoft Windows .EXE file, can be run only on a Windows platform. This is both an advantage and a significant limitation. The lack of portability causes significant problems when it comes to platform-independent applications such as web pages. In fact, a web developer has no clue as to what platform a user is surfing the web with. Thus, a web application must be able to support several different platforms. However, the browser itself is a statically linked application and must be created on each individual platform.

As you have seen, executables contain the machine language of the host machine. Thus, it is not portable across platforms. For example, you can use the Java compiler itself.

Although Java is not a statically linked language (it is actually a dynamically loaded language), the Java tools provided for specific platforms are statically linked executables. If you take a look at the Java installation directory, you can see that the bin directory contains a lot of Windows executables; one of them is the Java compiler, javac.exe. Figure 2 shows a screen shot of the Java executables contained in this directory. You will recognize many of the tools that are used in application development, such as the compiler (javac.exe) the virtual machine (java.exe), and so forth.



Click here for a larger image.

Figure 2: Statically Linked Java Applications

The issue here is that these are Microsoft Windows applications only. You could not copy this version of javac.exe and run it directly on a UNIX machine.

Obviously, there is a Java language specification. Although the Microsoft version of java.exe will run only on a Windows platform, the java compiler on UNIX and other platforms must abide by the same Java language specification. Despite the fact that each individual platform has its own non-portable java compiler, they all, theoretically at least, behave in consistent ways. Thus, a Java program on a Windows platform should run unchanged on a UNIX platform—even though the tools themselves were written on different platforms by different developers.

As already stated, the Microsoft javac.exe file contains Microsoft Windows-specific machine code. It is interesting to try and open an executable file with a text editor. This is actually a meaningless operation in this context, except that you will be able to compare it to a file of bytecodes and it provides a baseline for this discussion. When you open the javac.exe file in Notepad, you get the results seen in Figure 3.



Click here for a larger image.

Figure 3: javac.exe opened in Notepad.

Obviously, this exercise provides us no useful information. However, I always find it interesting to take a look at this type of output. It also provides students a way to differentiate between character and binary files. Because Notepad is a text editor, the characters displayed are the ACSII representation of the file. Perhaps the most interesting thing about the display of this file is that there are no recognizable words—at least none that I can determine. That is not the case when you look at a file of bytecodes in the same way.

Dynamically Linked Executables

The bytecode model uses a different approach. In this case, if you don't want the kitchen sink, you won't get it. The corresponding definition of dynamically linked is:

Definitions of dynamically linked on the Web:

Linked in name only, so that the executable file contains only the information needed to locate the code of a procedure—the name of the module that contains it and the name of the entry point. When the executable program is loaded, the module is also loaded, and the linkage between them is fixed in memory only.

techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi

Figure 4 shows the life-cycle of source code in the bytecode model. In this model, instead of creating machine dependent object modules, bytecode is produced. Although there are drawbacks to this model, which you will explore shortly, a primary advantage is that the bytecodes are, again theoretically, platform independent.



Click here for a larger image.

Figure 4: Bytecode model

Under The Hood

Perhaps the best way to explain the bytecode model is to look at it directly. You'll design a small Java application for this illustration. In this case, you will create a simple application called Performance presented in Listing 1 and use a class called Employee presented in Listing 2.

Listing 1: The Employee Class

public class Performance {

   public static void main(String[] args) {

      System.out.println("Performance Example");

      Employee joe = new Employee();

   }
}

Listing 2: The Employee Class

public class Employee {

   private int employeeNumber;

}

As I normally do with these examples, I compile them from a batch file as seen in Listing 3.

Listing 3: Compiling the Application

cls

"C:Program FilesJavajdk1.5.0_07binjavac" -Xlint -classpath
   .Performance.java

Although I do most of my development with an Integrated Development Environment (IDE), I normally will use batch files like these so that I know my CLASSPATH information is correct. This helps in the instruction phase of programming, and it also assists in the testing of the various versions of the development kits. For example, as was mentioned earlier, a web developer must allow for various platforms while developing and testing. In the same manner, a developer must allow for various versions of a development kit. If Java is the development platform, what version of the SDK should be used? The answer is that all reasonable versions must be tested. This means that multiple versions of the development kit may be installed on a machine at the same time. Therefore,, keeping track of the CLASSPATH is problematic.

To deal with this, I like to use batch files to insure that I am using the version of the development kit that I intend to use. Granted, there are much more sophisticated methods of doing this and there are many development tools available to the professional developer; however, in an academic environment, using a more simple, and inexpensive solution is often desirable.

When this application is compiled, there are two separate class files produced, Performance.class and Employee.class, as seen in Figure 5.



Click here for a larger image.

Figure 5: Application Class Files.

Take another look at Figure 3, when you opened the statically linked javac.exe file with Notepad. Open the employee.class file and see what you get. The results, using Notepad, can be seen in Figure 6.



Click here for a larger image.

Figure 6: Employee.class.

Again, this exercise provides no real benefit from a technical perspective; however, it does provide a window into the structure of the bytecodes. Primarily, you can see that there are some textual components of the file that are recognizable. The word Employee is clearly identifiable in at least a couple of locations. The reason why this is important is because it hints at the possibility of decoding this file and providing much more information about it. Could you potentially even re-create the original source code?

The thought of re-creating the original source code of the statically linked application is far beyond the reach of most any technology. Yet, is it possible to accomplish this task with bytecodes? The answer to this question is, for the most part. There are many technologies that perform the function of recreating source code from bytecodes, and you will explore this in later articles. For now, you can use something much more accessible, a tool provided by the Java SDK itself: javap.exe.



Click here for a larger image.

Figure 7: javap.exe.

The Java documentation identifies javap as the The Java Class File Disassembler. Their definition is as follows:

The javap command disassembles a class file. Its output depends on the options used. If no options are used, javap prints out the package, protected, and public fields and methods of the classes passed to it. javap prints its output to stdout.

http://java.sun.com/j2se/1.5.0/docs/tooldocs/windows/javap.html

You can take a look at the various options available for javap by calling javap with the -help options as seen in Figure 8.



Click here for a larger image.

Figure 8: The options for javap.exe (using javap - help).

For your examples, you will start with the -private flag to show you all classes and members. The best way to see what javap does is to run the Employee class through it as follows:



Click here for a larger image.

Figure 9: Running Employee.class through javap.exe.

It is interesting to look at the original source code and the disassembled source code right next to each other. Take a look at Listing 4 and study the differences.

Listing 4: The Employee Class (original source code and the disassembled source code)

public class Employee {

   private int employeeNumber;

}

public class Employee extends java.lang.Object{
   private int employeeNumber;
   public Employee();
}

There are two obvious differences between the two versions of the code. First, in the disassembled source code, it is apparent that Employee extends the java.lang.Object class. This is expected, because all objects in Java ultimately extends the Object class. However, here is irrefutable proof. The code was not in the original source code. Yet, the compiler has inserted it into the bytecode version. The second obvious difference is the fact that there is a constructor in the midst of the code.

public Employee();

Again, this is as expected. If no constructor is specified in the original code, a default constructor is supposed to be provided for you—and this is exactly what happened here. You can have some fun and see what happens when you do provide a constructor as seen in Listing 5.

Listing 5: The Employee Class (original source code and the disassembled source code)

public class Performance {

   public static void main(String[] args) {

      System.out.println("Performance Example");

      mployee joe = new Employee(1);

   }
public class Employee {

   private int employeeNumber;

   public Employee (int a) {

   }

}

Running this code through javap produces the output in Figure 10.



Click here for a larger image.

Figure 10: A Non-Default Constructor.

Notice that the default constructor is gone and is replaced by the constructor that you defined. This is exactly what you would have expected. Once again, the value of this exercise is primarily instructional; however, at times it can be a valuable debugging tool. Finally, put in a second constructor.

Listing 6: The Employee Class (original source code and the disassembled source code)

public class Performance {

   public static void main(String[] args) {

      System.out.println("Performance Example");

      Employee joe = new Employee(1);

   }
}
public class Employee {

   private int employeeNumber;

   public Employee (int a) {

   }

   public Employee (float a) {

   }
}

Now, javap produces the output in Figure 11. Notice that the name of the attribute datatype is not included in the method parameter list—they are only listed as int and float. Also note that in the original source code, both of the parameter names are the same. This leads you to an interesting topic that you will cover extensively in a future article as well.



Click here for a larger image.

Figure 11: Two Constructors.

Compiling the Disassembled

One of the interesting questions is whether or not this disassembled code can actually be compiled and used. The easiest way to test this is to use it and compile it. The code, incorporating the resultant code from javap, is shown in Listing 7.

Listing 7: The Employee Class (the disassembled source code)

public class Performance {

   public static void main(String[] args) {

      System.out.println("Performance Example");

      Employee joe = new Employee();

   }
}
public class Employee extends java.lang.Object{
   private int employeeNumber;
   public Employee();
}

The answer to that questions is no—at least not directly. The javap application (at least with the -private option) seems to have provided only the signature of the method—not the body.



Click here for a larger image.

Figure 12: Compiled the disassemble Employee source.

If you do add a method body as seen in Listing 8, the dissembled code will work; but, this is cheating. What is the point of disassembling the code if you can't compile it directly? This is a question you will explore in the next article.

Listing 8: The Employee Class (the disassembled source code)

public class Performance {

   public static void main(String[] args) {

      System.out.println("Performance Example");

      Employee joe = new Employee();

   }
}
public class Employee extends java.lang.Object{
   private int employeeNumber;
   public Employee() { };
}

Conclusion

In this article, you began to explore how a class file is designed and how you can disassemble it. Although there are a few applications of the process that can assist the professional developer, it is often a very good mechanism for instructional purposes. Understanding what goes on under the hood of an application is a beneficial process. At a more detailed level, this exercise provides the framework for the Java Virtual Machine.

In next month's article, you will delve more deeply into understanding how the structure of bytecodes can help in the construction and testing phases of the software development process and how it affects the performance and security of an application.

References

About the Author

Matt Weisfeld is a faculty member at Cuyahoga Community College (Tri-C) in Cleveland, Ohio. Matt is a member of the Information Technology department, teaching programming languages such as C++, Java, C#, and .NET as well as various Web technologies. Prior to joining Tri-C, Matt spent 20 years in the information technology industry gaining experience in software development, project management, business development, corporate training, and part-time teaching. Matt holds an MS in computer science and an MBA in project management. Besides The Object-Oriented Thought Process, which is now in its second edition, Matt has published two other computer books, and more than a dozen articles in magazines and journals such as Dr. Dobb's Journal, The C/C++ Users Journal, Software Development Magazine, Java Report, and the international journal Project Management. Matt has presented at conferences throughout the United States and Canada.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date