http://www.developer.com/

Back to article

Java Language Integrity & Security: Fine Tuning Bytecodes


April 4, 2007

This series, The Object-Oriented Thought Process, is intended for someone just learning an object-oriented language and who wants to understand the basic concepts before jumping into the code, or someone who wants to understand the infrastructure behind an object-oriented language he or she is already using. These concepts are part of the foundation that any programmer will need to make the paradigm shift from procedural programming to object-oriented programming.

Click here to start at the beginning of the series.

In keeping with the code examples used in the previous articles, Java will be the language used to implement the concepts in code. One of the reasons that I like to use Java is because you can download the Java compiler for personal use at the Sun Microsystems Web site http://java.sun.com/. You can download the standard edition, J2SE 5.0, at http://java.sun.com/j2se/1.5.0/download.jsp to compile and execute these applications. I often reference the Java J2SE 5.0 API documentation and I recommend that you explore the Java API further. Code listings are provided for all examples in this article as well as figures and output (when appropriate). See the first article in this series for detailed descriptions for compiling and running all the code examples.

The code examples in this series are meant to be a hands-on experience. There are many code listings and figures of the output produced from these code examples. Please boot up your computer and run these exercises as you read through the text.

Last month, you began an examination of the structure of bytecodes. You explored how a class file is designed and how you can disassemble it. Although there may be very few instances when you would need to disassemble code, it is often a very good mechanism for instructional purposes. Understanding what goes on under the hood of an application is a beneficial process. In this article, you continue the discussion of the structure of bytecodes by exploring how you can process them to improve performance, security, intellectual property protection, and readability issues.

Inspecting Classes

One topic that is quite interesting to investigate is the relationship between source code and the bytecodes that the compiler produces. In fact, explore this from various perspectives. You can look at performance issues, security issues, intellectual property issues, and even readability issues. In many cases, one or more of these topics intersect. For example, fine tuning bytecodes for increased performance can also lead to more secure code. The same goes for intellectual property concerns, which can go hand-in-hand with dealing with code performance. It is important, and quite interesting, to understand the effect on fine-tuning bytecodes can have.

When you talk about fine-tuning bytecodes, you are actually changing the bytecodes themselves. This can be problematic because it is normally not a good idea to actually change output from the compiler. Figure 1 illustrates the process by which bytecodes are created.



Click here for a larger image.

Figure 1: The Bytecode Update Process (using a bytecode processing tool).

Note that the bytecodes produced by the compiler are fed directly into the virtual machine. This implies that it is possible to alter the bytecode file (the class file) to various ends. In fact, this defines a security threat that the Java virtual machine takes pains to avoid. You certainly do not want any malicious code introduced to your already compiled bytecodes, malicious code that the virtual machine is unaware of. However, this situation allows you to exploit this technique for legitimate and productive purposes.

To understand how you might fine-tune bytecodes, explore how you could do the same thing by hand. This is an interesting approach. The implication is that if you were to fine-tune bytecodes, the same fine-tuning could have (perhaps should have) been done at the source code level.

There is some truth to this; however, not in all cases. There are certainly times when fine-tuning at the bytecode level that is directed at undesirable source code, perhaps you can call it poorly written source code. Yet, as already mentioned, there are many reasons why fine-tuning bytecodes has nothing at all to do with poorly written source code. Rather, the updates to the bytecodes result from a desire to improve performance, security, intellectual property protection and readability issues.

The one thing that must be remembered is that humans look at code totally differently than the computer. This may seem like an obvious statement; however, high-level languages were developed for just this reason. Thus, source code and bytecodes are written for completely different audiences. This means that what might be good for one is not necessarily good for the other.

To illustrate what I mean by this, consider the code formatting styles that have been adopted by the software development industry. These rules are in place to make source code easier to develop, whether it is writing, reading or maintaining code. In short, the rules are in place to help humans make sense of the process. Certain developments in source code creation provide absolutely no specific benefit to the way the machine interprets the source code.

The adoption of coding standards is a perfect point. Take a look at the code in Listing 1. Just by looking at this code can you tell what it is doing? There are some clues, but it is not necessarily obvious.

class Example {
   public static void main(String args[]) {
      double x = 0.0;
      double y = 0.0;
      x = 22;
      y = (x * 1.8) +32;
      System.out.println ("x = " + x);
      System.out.println ("y = " + y);
   }
}

Listing 1: Code without descriptive names

Now, look at the code in Listing 2. Even though both applications behave the exact same way and produce the exact same output, my bet is that you agree Listing 2 provides better-documented code. The only thing that is different is the naming conventions.

class CelsiustoFarenheit {
   public static void main(String args[]) {
      double celsius = 0.0;
      double farenheit = 0.0;
      celsius = 22;
      farenheit = (celsius * 1.8) +32;
      System.out.println ("celsius = " + celsius);
      System.out.println ("farenheit = " + farenheit);
   }
}

Listing 2: Code with descriptive names

There are any number of similar examples that can be presented to illustrate this point, including the use of whitespace and comments. While none of this is exactly rocket science, much of the way we write code today deals with readability issues. Often, these issues make no difference to the machine itself. Still, it is interesting to look at things from the perspective of the machine.

Now, pursue the angle of code performance. There are times when code readability and performance issues do not line up. As demonstrated in Listings 1 and 2, making attribute names meaningful is an important way to make your code easier to develop and maintain. However, is this beneficial from a performance perspective?

Under The Hood Once Again

As a baseline, consider the code presented in Listing 3.

public class Performance {

   public static void main(String args[]) {

      CompanyApp app = new CompanyApp ();

      app.Employee(2001);
      app.Finance(3001.0);

      System.out.println();

   }
}

class CompanyApp {

      private int companyID = 1001;

      public void Employee(int number) {

         int employeeNumber = number;

         System.out.println("nInside Employee");
         System.out.println("companyID      = " + companyID);
         System.out.println("employeeNumber = " + employeeNumber);

      }

      public void Finance(double bal) {

         double balance = bal;

         System.out.println("nInside Finance");
         System.out.println("companyID = " + companyID);
         System.out.println("balance   = " + balance);

      }
}

Listing 3: The Example Application

When you run this application, you get the results in Figure 2. It is important to note the output for later comparison purposes.



Click here for a larger image.

Figure 2: Example Application Output

The application is named Performance and contains the following constructs.

Class:

CompanyApp

   Class attribute: private int companyID

Methods:

public void Employee(int number)

   Method attribute: int employeeNumber

public void Finance(double bal)

   Method attribute: double balance

Note that the names were selected to provide some meaning to the reader of the code. It is obvious what the attributes companyID and balance are meant to be—at least at a high level. In other words, these names are more descriptive that if you had named them x and y. However, might there be a situation when naming the attributes x and y would be preferable? The answer is yes, at least at the bytecode level.

In Scope

Two of the issues mentioned earlier in the article were that of performance and the protection of intellectual property. These issues dovetail well into this discussion of code readability. The main point revolves around the statement made that creating descriptive attribute names make code more readable. This in itself probably can't be disputed; however, arguments can be made that creating descriptive attribute names is not necessarily the best for performance and the protection of intellectual property.

It is pretty easy to see that if you have 100 attribute names in your application, and each one averages 7 characters, you have to obtain storage for at least 700 characters. Yet, just by reducing the average number of characters to 3, you only need 300 characters. Obviously, you have a savings of over one half. This may seem trivial, and in many cases it is; however, for hardware with small memory footprints, memory usage like this can add up—in this example, you are only talking about 100 attributes.

You can even save more memory space if you take advantage of the concept of scope. Theoretically, from the perspective of memory usage, it would be nice if you could name every single variable as a single character. This has apparent limits. For starters, there are only 26 letters in the alphabet. So, you could not have named all of your 100 attributes using a single letter; you would run out of letters. In this case, you are fortunate because you can take advantage of the fact that only attributes with the same scope are required to have different names.

For example, in the Performance application, there is a class attribute named companyID, the Employee() method has an attribute named employeeNumber, and the Finance() method has an attribute named balance. All three of these attributes have totally separate scope. In fact, you could have named all three of the attributes companyID, or even simply a. Assume that you do name all of these attributes a. Although there would not be any compiler confusion with the attributes in the two methods, the class variable would lose the precedence battle with the tighter scope of the methods. Thus, you must get used to including the this pointer in your code as is done in Listing 4.

Note: The this pointer, unfortunately named perhaps, means that you use the scope of the object. Thus, the code this.a simply means to use the attribute a defined at the class level.

The code including the this pointer is highlighted in red.

public void Employee(int number) {

   int employeeNumber = number;
   System.out.println("nInside Employee ");
   System.out.println("companyID      = " + this.companyID);
   System.out.println("employeeNumber = " + employeeNumber);
}

public void Finance(double bal) {

   double balance = bal;

   System.out.println("nInside Finance");
   System.out.println("companyID = " + this.companyID);
   System.out.println("balance   = " + balance);
}

Listing 4: The Example Application Using the this Pointer

Using the this pointer makes the behavior of the code a bit more obvious, and it provides the groundwork for the concept we explore next. See how far you can go by naming everything you can in the application to the name a.

Obfuscating the Code

One of the ways that you can attempt to protect the code's intellectual property is to make the code harder to read. This is not the same as mangling the code or encrypting the code. Mangling or encryption implies that the code has to be decoded—perhaps with an algorithm. You can explore these concepts later; however, at this point you are just going to take a simple first step by making the code more difficult to follow. The act of making the code more difficult to read, or less clear, is sometimes called obfuscation. You can proceed in three steps.

The first step is to change all the attributes to a. This is accomplished in Listing 5.

public class Performance {

   public static void main(String args[]) {

      CompanyApp app = new CompanyApp ();

      app.Employee(2001);
      app.Finance(3001.0);

      System.out.println();

   }
}

class CompanyApp {

   private int a = 1001;

   public void Employee(int number) {

      int a = number;

      System.out.println("nInside Employee");
      System.out.println("companyID      = " + this.a);
      System.out.println("employeeNumber = " + a);

   }

   public void Finance(double bal) {

      double a = bal;

      System.out.println("nInside Finance");
      System.out.println("companyID = " + this.a);
      System.out.println("balance   = " + a);

   }
}

Listing 5: The Example Application, Obfuscating the attribute names.

With this change, not only are all attributes a single character in length, they are all the same character. Besides savings pertaining to the length of the attributes, there are ramifications internally as to how the compiler stores and represents attributes—you will learn about this in later articles. Although this code may not be as human friendly as the previous version, it behaves exactly the same. When you run this example, you will get the output in Figure 2.

One interesting exercise you can perform here pertains to the use of the this pointer. If you take out the this pointer, you will get different results. For example, in the Finance() method, the following code will produce incorrect results because both lines will bind to the method variable—so the output for the companyID is incorrect.

   System.out.println("companyID = " + a);
   System.out.println("balance   = " + a);

As stated earlier, this exercise reinforces the concept of scope quite well. The use of scope is fundamental to object-oriented development, yet it is one of the most difficult concepts for beginning students to grasp. Even advanced developers find the variations are tricky at times. It is fortunate that not only is scope something you must understand; as you are finding, you can use it to your advantage.

The second step in obfuscating this code is changing all the methods to a. By this time. it may seem that you are being a bit boring by changing everything to simply a. However, this is the point; make the code boring and more difficult to read. And, as you are finding, boring code can also provide security and performance advantages.

Rather than provide a complete and separate code listing for both Steps 2 and 3, and in an effort to save some space (your own performance requirement), you will combine the listing for both steps.

Where Step 2 changed all the methods names to a, Step 3 changes the class names to a as well. In this case, there is only a single class name, CompanyApp. Listing 6 shows what the code looks like with all the programmer defined names listed as simply a.

public class Performance {

   public static void main(String args[]) {

      a app = new a ();

      app.a(2001);
      app.a(3001.0);

      System.out.println();

   }
}

class a {

   private int a = 1001;

   public void a(int number) {

      int a = number;

      System.out.println("nInside Employee");
      System.out.println("companyID      = " + this.a);
      System.out.println("employeeNumber = " + a);

   }

   public void a(double bal) {

      double a = bal;

      System.out.println("nInside Finance");
      System.out.println("companyID = " + this.a);
      System.out.println("balance   = " + a);

   }
}

Listing 6: The Example Application, Obfuscating the method and class names

Another interesting issue here is that the two may have the same name, a ; however, they actually do have different signatures.

   public void a(int number)
   public void a(double bal)

Because the first method passes an integer and the second a double, you have the luxury of being able to name them the same. If they both had the same signature, they would have to have unique names. There are a lot of subtle issues that you can take advantage of, from both a practical and an academic perspective.

When the Listing 6 application is executed, you get the exact same results that you obtained with the more readable code, shown in Figure 3; however, you now have code that is somewhat more difficult to understand and has a bit better performance.

Obviously, these examples are meant to represent the concepts behind the techniques and for small applications like these, the effect is not that great. Yet, when these techniques are extrapolated for much larger applications, the benefits can be significant.



Click here for a larger image.

Figure 3: Example Application Output after Obfuscation (same result)

There is also the issue of the strings in the println() method.

   System.out.println("amount  = " + a);
   System.out.println("balance = " + b);

These strings convey some of the intent of the code. You can even hide their meaning by using attributes for the string descriptions and loading them via parameters, file or database loads, or even user inputs.

Limits of Scope

There are, of course, limits to the number of user defined names that you can change to a single character. You saw that in the case of class attributes you had to utilize the this pointer to differentiate scope. At some point, the scope conflicts cannot be resolved by using a single character name. In the following code, the method a contains two attributes, a and b. Because the attributes a and b are within the same scope, they must be unique names.

public void a(double amt , double bal) {

   double a = amt;
   double b = bal;

   System.out.println("nInside Finance");
   System.out.println("companyID = " + this.a);
   System.out.println("amount    = " + a);
   System.out.println("balance   = " + b);

}

Returning to the example in Listing 6, the last holdout for the user-defined names are the parameters passed into the two methods. As you saw before when exploring the signature, each method has a single parameter and they both have a user-defined name, which is somewhat descriptive.

   public void a(int number)
   public void a(double bal)

It would be nice to define both of the parameters, number and bal, simply as a. But you can't; you have reached another limit of scope.

You run into a problem if you define both of these attributes as a, as in the following code.

   public void a(int a)
   public void a(double a)

In both these cases, you now have a conflict because each of the methods already has an attributes defined as a, and you obviously can't have two attributes with the same name. So, what do you do now? You probably guessed it; you use b, of course. When we get to the point where you have used as many as and bs as you can, you use c, and so on. The concept is elegantly simple; however, it gets very confusing because it is very hard to follow all of the as, bs, and cs—yet, that is the whole point.

Thus, to complete this example, you define the parameters as c and you end up with the code in Listing 7.

public class Performance {

   public static void main(String args[]) {

      a app = new a ();

      app.a(2001);
      app.a(3001.0);

      System.out.println();

   }
}

class a {

   private int a = 1001;

   public void a(int b) {

      int a = b;

      System.out.println("nInside Employee");
      System.out.println("companyID      = " + this.a);
      System.out.println("employeeNumber = " + a);

   }

   public void a(double b) {

      double a = b;

      System.out.println("nInside Finance");
      System.out.println("companyID = " + this.a);
      System.out.println("balance   = " + a);

   }
}

Listing 7: The Employee Class

Part of the Build Process

One of the questions that you might ask is that if these techniques are meant to make the code so hard to read, and thus maintain, why in the world would anyone want to employe them, because you will have to read and maintain the code? The beauty of this approach is that you always operate on the original, well-documented, human-readable, source code.

The elegance of this approach is that the techniques (basically a software tool) are always applied to the bytecodes that come out of the compiler and not the source code. Thus, as part of the process defined in Listing 1, the programmer always updates the source code, just as expected, and then the tool is applied to the bytecodes. For all intent and purposes, the alteration of the bytecodes is transparent to the programmer. It simply becomes part of the build process.

Conclusion

In this article, you began to explore how you can process the bytecodes produced by the compiler to help you improve performance, security, intellectual property protection, and readability issues. You have only skimmed the surface of these issues. As an academic exercise that can lead to some very good practical applications, you can actually decompile the bytecodes to see what the compiler is actually doing in its own right.

In next month's article, you will do just that and delve more deeply into understanding how the structure of bytecodes can help in the construction and testing phases of the software development process and how it affects the performance and security of an application.

References

  • www.sun.com
  • Fine Tuning Java Applications. Java Report, November 1999. Matt Weisfeld & Gabriel Torok.

About the Author

Matt Weisfeld is a faculty member at Cuyahoga Community College (Tri-C) in Cleveland, Ohio. Matt is a member of the Information Technology department, teaching programming languages such as C++, Java, C#, and .NET as well as various Web technologies. Prior to joining Tri-C, Matt spent 20 years in the information technology industry gaining experience in software development, project management, business development, corporate training, and part-time teaching. Matt holds an MS in computer science and an MBA in project management. Besides The Object-Oriented Thought Process, which is now in its second edition, Matt has published two other computer books, and more than a dozen articles in magazines and journals such as Dr. Dobb's Journal, The C/C++ Users Journal, Software Development Magazine, Java Report, and the international journal Project Management. Matt has presented at conferences throughout the United States and Canada.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date