http://www.developer.com/

Back to article

J2ME Game Optimization Secrets


July 14, 2003

This article describes the role that code optimization plays in writing fast games for mobile devices. Using examples I will show how, when and why to optimize your code to squeeze every last drop of performance out of MIDP-compliant handsets. We will discuss why optimization is necessary and why it is often best NOT to optimize. I explain the difference between high-level and low-level optimization and we will see how to use the Profiler utility that ships with the J2ME Wireless Toolkit to discover where to optimize your code. Finally this article reveals lots of techniques for making your MIDlets move.

Why Optimize?

It is possible to divide video games into two broad categories; real-time and input-driven. Input-driven games display the current state of game play and wait indefinitely for user input before moving on. Card games fall into this category, as do most puzzle games, strategy games and text adventures. Real-time games, sometimes referred to as Skill and Action games do not wait for the player - they continue until the game is over.

Skill and Action games are often characterized by a great deal of on-screen movement (think of Galaga or Robotron). Refresh rates must be at least 10fps (frames per second) and there has to be enough action to keep gamers challenged. They require fast reflexes and good hand-eye co-ordination from the player, so compelling S&A games must also be extemely responsive to user input. Providing all that graphical activity at high framerates while responding quickly to key-presses is why code for real-time games has to be fast. The challenges are even greater when developing for J2ME.

Java 2 Micro Edition (J2ME) is a stripped-down version of Java, suitable for small devices with limited capabilities, such as cell phones and PDAs. J2ME devices have:

  • limited input capabilities (no keyboard!)
  • small display sizes
  • restricted storage memory and heap sizes
  • slow CPUs

Writing fast games for the J2ME platform further challenges developers to write code that will perform on CPUs far slower than those found on desktop computers.

When not to Optimize

If you're not writing a Skill and Action game, there's probably no need to optimize. If the player has pondered her next move for several seconds or minutes, she will probably not mind if your game's response to her action takes more than a couple hundred milliseconds. An exception to this rule is if the game needs to perform a great deal of processing in order to determine its next move, such as searching through a million possible combinations of chess pieces. In this case, you might want to optimize your code so that the computer's next move can be calculated in a few seconds, not minutes.

Even if you are writing this type of game, optimization can be perilous. Many of these techniques come with a price tag - they fly in the face of conventional notions of "Good" programming and can make your code harder to read. Some are a trade-off, and require the developer to significantly increase the size of the app in order to gain just a minor improvement in performance. J2ME developers are all too familiar with the challenges of keeping their JAR as small as possible. Here are a few more reasons not to optimize:

  • optimization is a good way to introduce bugs
  • some techniques decrease the portability of your code
  • you can expend a lot of effort for little or no results
  • optimization is hard

That last point needs some illumination. Optimization is a moving target, only more so on the Java platform, and even more so with J2ME because the execution environments vary so greatly. Your optimized code might run faster on an emulator, but slower on the actual device, or vice versa. Optimizing for one handset might actually decrease performance on another.

But there is hope. There are two passes you can make at optimization, a high-level and a low-level. The first is likely to increase execution performance on every platform, and even improve the overall quality of your code. The second pass is the one more likely to cause you headaches, but these low-level techniques are quite simple to introduce, and even simpler to omit if you don't want to use them. At the very least, they're very interesting to look at.

We will also use the System timer to profile your code on the actual device, which can help you to gauge exactly how effective ( or utterly ineffective ) these techniques can be on the hardware you're targeting for deployment.

And one last bullet-point:

  • optimizing is fun

A Bad Example

Let's take a look at a simple application that consists of two classes. First, the MIDlet...

import javax.microedition.midlet.*;import javax.microedition.lcdui.*;public class OptimizeMe extends MIDlet implements CommandListener {  private static final boolean debug = false;  private Display display;  private OCanvas oCanvas;  private Form form;  private StringItem timeItem = new StringItem( "Time: ", "Unknown" );  private StringItem resultItem =                             new StringItem( "Result: ", "No results" );  private Command cmdStart = new Command( "Start", Command.SCREEN, 1 );  private Command cmdExit = new Command( "Exit", Command.EXIT, 2 );  public boolean running = true;  public OptimizeMe() {    display = Display.getDisplay(this);    form = new Form( "Optimize" );    form.append( timeItem );    form.append( resultItem );    form.addCommand( cmdStart );    form.addCommand( cmdExit );    form.setCommandListener( this );    oCanvas = new OCanvas( this );  }  public void startApp() throws MIDletStateChangeException {    running = true;    display.setCurrent( form );  }  public void pauseApp() {    running = false;  }  public void exitCanvas(int status) {    debug( "exitCanvas - status = " + status );    switch (status) {      case OCanvas.USER_EXIT:        timeItem.setText( "Aborted" );        resultItem.setText( "Unknown" );      break;      case OCanvas.EXIT_DONE:        timeItem.setText( oCanvas.elapsed+"ms" );        resultItem.setText( String.valueOf( oCanvas.result ) );      break;    }    display.setCurrent( form );  }  public void destroyApp(boolean unconditional)                           throws MIDletStateChangeException {    oCanvas = null;    display.setCurrent ( null );    display = null;  }  public void commandAction(Command c, Displayable d) {    if ( c == cmdExit ) {      oCanvas = null;      display.setCurrent ( null );      display = null;      notifyDestroyed();    }    else {      running = true;      display.setCurrent( oCanvas );      oCanvas.start();    }  }  public static final void debug( String s ) {    if (debug) System.out.println( s );  }}

Second, the OCanvas class that does most of the work in this example...

import javax.microedition.midlet.*;import javax.microedition.lcdui.*;import java.util.Random;public class OCanvas extends Canvas implements Runnable {  public static final int USER_EXIT = 1;  public static final int EXIT_DONE = 2;  public static final int LOOP_COUNT = 100;  public static final int DRAW_COUNT = 16;  public static final int NUMBER_COUNT = 64;  public static final int DIVISOR_COUNT = 8;  public static final int WAIT_TIME = 50;  public static final int COLOR_BG = 0x00FFFFFF;  public static final int COLOR_FG = 0x00000000;  public long elapsed = 0l;  public int exitStatus;  public int result;  private Thread animationThread;  private OptimizeMe midlet;  private boolean finished;  private long started;  private long frameStarted;  private long frameTime;  private int[] numbers;  private int loopCounter;  private Random random = new Random( System.currentTimeMillis() );  public OCanvas( OptimizeMe _o ) {    midlet = _o;    numbers = new int[ NUMBER_COUNT ];    for ( int i = 0 ; i < numbers.length ; i++ ) {      numbers[i] = i+1;    }  }  public synchronized void start() {    started = frameStarted = System.currentTimeMillis();    loopCounter = result = 0;    finished = false;    exitStatus = EXIT_DONE;    animationThread = new Thread( this );    animationThread.start();  }  public void run() {    Thread currentThread = Thread.currentThread();    try {      while ( animationThread == currentThread && midlet.running                                                && !finished ) {        frameTime = System.currentTimeMillis() - frameStarted;        frameStarted = System.currentTimeMillis();        result += work( numbers );        repaint();        synchronized(this) {          wait( WAIT_TIME );        }        loopCounter++;        finished = ( loopCounter > LOOP_COUNT );      }    }    catch ( InterruptedException ie ) {      OptimizeMe.debug( "interrupted" );    }    elapsed = System.currentTimeMillis() - started;    midlet.exitCanvas( exitStatus );  }  public void paint(Graphics g) {    g.setColor( COLOR_BG );    g.fillRect( 0, 0, getWidth(), getHeight() );    g.setColor( COLOR_FG );    g.setFont( Font.getFont( Font.FACE_PROPORTIONAL,          Font.STYLE_BOLD | Font.STYLE_ITALIC, Font.SIZE_SMALL ) );    for ( int i  = 0 ; i < DRAW_COUNT ; i ++ ) {      g.drawString( frameTime + " ms per frame",                     getRandom( getWidth() ),                     getRandom( getHeight() ),                     Graphics.TOP | Graphics.HCENTER );    }  }  private int divisor;  private int r;  public synchronized int work( int[] n ) {    r = 0;    for ( int j = 0 ; j < DIVISOR_COUNT ; j++ ) {      for ( int i = 0 ; i < n.length ; i++ ) {        divisor = getDivisor(j);        r += workMore( n, i, divisor );      }    }    return r;  }  private int a;  public synchronized int getDivisor( int n ) {    if ( n == 0 ) return 1;    a = 1;    for ( int i = 0 ; i < n ; i++ ) {      a *= 2;    }    return a;  }  public synchronized int workMore( int[] n, int _i, int _d ) {    return n[_i] * n[_i] / _d + n[_i];  }  public void keyReleased(int keyCode) {    if ( System.currentTimeMillis() - started > 1000l ) {      exitStatus = USER_EXIT;      midlet.running = false;    }  }  private int getRandom( int bound )   {  // return a random, positive integer less than bound    return Math.abs( random.nextInt() % bound );  }}

This app is a MIDlet that simulates a simple game loop:

  • work
  • draw
  • poll for user input
  • repeat

For fast games, this loop has to be as tight and fast as possible. Our loop continues for a finite number of iterations (LOOP_COUNT = 100) and uses the System timer to calculate how long in milliseconds the whole exercise took, so we can measure and improve its performance. The time and the results of the work are displayed on a simple Form. Use the Start command to begin the test. Pressing any key will exit the loop prematurely. The Exit command exits the application.

In most games, the work phase of the main game loop involves updating the state of the game world - moving all the actors, testing for and reacting to collisions, updating scores, etc. In this example, we're not doing anything particularly useful. The code simply runs through an array of numbers and performs a few arithmetic operations on each, aggregating the results in a running total.

The run() method also calculates the amount of time it takes to execute each iteration of the loop. Every frame, the OCanvas.paint() method displays this value in milliseconds at 16 random locations on screen. Normally you would be drawing the graphical elements of your game in this method, but our code offers a reasonable facsimile of this process.

Regardless of how pointless this code may seem, it will allow us ample opportunity to improve its performance.

Where to Optimize - the 90/10 rule

In performance-hungry games, 90 percent of a program's execution time is spent running 10 percent of the code. It is in this 10 percent of the code that we should concentrate all of our optimization efforts. We locate that 10 percent using a profiler. To turn on the Profiler Utility in the J2ME Wireless Toolkit, select the Preferences item from the Edit menu. This will bring up the Preferences window. Select the Monitoring tab, check the box marked "Enable Profiling", and click the OK button. Nothing will happen. That's okay - we need to run our program in the emulator and then exit before the Profiler window appears. Do that now.

Figure 1 illustrates how to turn on the Profiler Utility.



Click here for larger image

My emulator (running under Windows XP on a 2.4GHz Intel P4) reports that 100 iterations of the loop took 6,407ms, or just under six and a half seconds. The app reported 62 or 63ms per frame. On the hardware ( a Motorola i85s ) it ran much slower. Time per frame was around 500ms and the whole thing ran in 52460ms. We'll try to improve these figures in the course of this article.

When you exit the application the profiler window will appear and you will see something that resembles a folder browser, with the familiar tree widget displayed in a panel on the left. Method relationships are shown in this hierarchical list. Each folder is a method and opening a methods folder displays the methods called by it. Selecting a method in the tree shows the profiling information for that method and all the methods called by it in the panel on the right. Notice that a percentage is displayed next to each element. This is the percentage of total execution time spent in that particular method. We must navigate through this tree to find where all the time is going, and optimize the methods with the highest percentages, where possible.

Figure 2 - the Profiler Utility call graph.



Click here for larger image

A couple of notes about the profiler. First, your percentages will almost certainly vary from mine, but they will be similarly proportioned - always follow the biggest numbers. My numbers varied every time I ran the app. To keep things as uniform as possible, you might want to close any background applications, like email clients, and keep activity to a minimum while you're running tests. Also, don't obfuscate your code before using the profiler or all your methods will be mysteriously named "b" or "a" or "ff". Finally, the profiler doesn't shed any light on the performance of whatever device you are emulating. The hardware itself is a completely different animal.

Opening the folder with the highest percentages we see that 66.8% of the execution time went to a method named "com.sun.kvem.midp.lcdui.EmulEventHandler$EventLoop.run", which doesn't really help us. Digging down further unfolds a level or two of methods with similarly obscure names. Keep going and you'll trace the fat percentages to serviceRepaints() and finally to our OCanvas.paint() method. Another 30% of the time went into our OCanvas.run() method. It should come as no surprise that both of these methods live inside of our main game loop. We won't be spending any time trying to optimize code in our MIDlet class, just as you shouldn't bother optimizing code in your game that's outside the main game loop. Only optimize where it counts.

The way that these percentages are divided in our example app is not entirely uncharacteristic of how they would be in a real game. You will most likely find that the vast proportion of execution time in a real videogame is spent in the paint() method. Graphics routines take a very long time when compared to non-graphical routines. Unfortunately, our graphics routines are already written for us somewhere below the surface of the J2ME API, and there's not much we can do to improve their performance. What we can do is make smart decisions about which ones we use and how we use them.

High-Level vs. Low-Level optimization

Later in this article we will look at low-level code optimization techniques. You will see they are quite easy to plug into existing code and tend to degrade readability as much as they improve performance. Before you employ those techniques, it is always best to work on the design of your code and its algorithms. This is high-level optimization.

Michael Abrash, one of the developers of id software's "Quake", once wrote, "the best optimizer is between your ears". There's more than one way to do it (TMTOWTDI) and if you take the extra time to think about doing things the right way first, then you will reap the greatest reward. Using the right (i.e. fastest) algorithm will increase performance much more than using low-level techniques to improve a mediocre algorithm. You might shave a few more percentage points off using the low-level stuff, but always start at the top and use your brain (you'll find it between your ears).

So let's look at what we're doing in our paint() method. We're calling Graphics.drawString() 16 times every iteration to paint our message "n ms per frame" onscreen. We don't know anything about the inner workings of drawString, but we can see that it's using up a lot of time, so let's try another approach. Let's draw the String directly onto an Image object once and then draw the Image 16 times.

  public void paint(Graphics g) {    g.setColor( COLOR_BG );    g.fillRect( 0, 0, getWidth(), getHeight() );    Font font = Font.getFont( Font.FACE_PROPORTIONAL,                               Font.STYLE_BOLD | Font.STYLE_ITALIC,                              Font.SIZE_SMALL );    String msMessage = frameTime + "ms per frame";    Image stringImage =          Image.createImage( font.stringWidth( msMessage ),                             font.getBaselinePosition() );    Graphics imageGraphics = stringImage.getGraphics();    imageGraphics.setColor( COLOR_BG );    imageGraphics.fillRect( 0, 0, stringImage.getWidth(),                             stringImage.getHeight() );    imageGraphics.setColor( COLOR_FG );    imageGraphics.setFont( font );    imageGraphics.drawString( msMessage, 0, 0,                               Graphics.TOP | Graphics.LEFT );    for ( int i  = 0 ; i < DRAW_COUNT ; i ++ ) {      g.drawImage( stringImage, getRandom( getWidth() ),                    getRandom( getHeight() ),                    Graphics.VCENTER | Graphics.HCENTER );    }  }

When we run this version of our software, we see that the percentage of time spent in our paint method is a little less. Looking deeper we can see that the drawString method is only being called 101 times, and it is now the drawImage method that sees most of the action, being called 1616 times. Even though we are doing more work, the app runs a little quicker because the graphics calls we are using are faster.

You will probably notice that drawing the string to an image affects the display because J2ME does not support image transparency, so a lot of the background is overwritten. This is a good example of how optimization might cause you to re-evaluate application requirements. If you really needed the text to overlap, you might be forced to deal with the slower execution time.

This code might be slightly better, but it still has a lot of room for improvement. Let's look at our first low-level optimization technique.

Out of the loop?

Code inside a for() loop will be executed as many times as the loop iterates. To improve performance, therefore, we want to leave as much code as possible outside of our loops. We can see from the profiler that our paint() method is being called 101 times, and the loop inside iterates 16 times. What can we leave out of those two loops? Let's start with all those declarations. We are declaring a Font, a String, an Image and a Graphics object every time paint() is called. We'll move these outside of the method and to the top of our class.

public static final Font font =   Font.getFont( Font.FACE_PROPORTIONAL,                 Font.STYLE_BOLD | Font.STYLE_ITALIC,                 Font.SIZE_SMALL);public static final int graphicAnchor =                    Graphics.VCENTER | Graphics.HCENTER;public static final int textAnchor =       Graphics.TOP | Graphics.LEFT;private static final String MESSAGE = " ms per frame";private String msMessage = "000" + MESSAGE;private Image stringImage;private Graphics imageGraphics;private long oldFrameTime;

You'll notice I made the Font object a public constant. This is often a useful thing to do in your apps, as you can gather the font declarations you use most together in one place. I have found the same goes for anchors, so I've done the same with the text and graphic anchors. Pre-calculating these things keeps those calculations, however insignificant, out of our loop.

I've also made the MESSAGE a constant. That's because Java loves to create String objects all over the place. Strings can be a huge memory drain if they're not controlled. Don't take them for granted, or you will probably bleed memory, which can in turn affect performance, especially if the garbage collector is being called too often. Strings create garbage, and garbage is bad. Using a String constant reduces the problem. Later we'll see how to use a StringBuffer to completely stop memory loss from String abuse.

Now that we've made these things instance variables, we need to add this code to the constructor:

stringImage = Image.createImage( font.stringWidth( msMessage ),                                 font.getBaselinePosition() );imageGraphics = stringImage.getGraphics();imageGraphics.setFont( font );

Another cool thing about getting our Graphics object up-front is that we can set the font once and then forget about it instead of setting it every time we iterate through the loop. We still have to wipe the Image object each time using fillRect(). Gung-ho coders might see an opportunity there to create two Graphics objects from the same Image, and pre-set the color of one to COLOR_BG for the call to fillRect() and COLOR_FG on the other for the calls to drawString(). Unfortunately, the behavior of getGraphics() when called multiple times on the same Image is ill-defined in J2ME and differs across platforms so your optimization tweak might work on Motorola but not NOKIA. If in doubt, assume nothing.

There is another way to improve on our paint() method. Using our brain again we realize that we only need to re-draw the string if the frameTime value has changed since the method was last called. That's where our new variable oldFrameTime comes in. Here's the new method:

public void paint(Graphics g) {  g.setColor( COLOR_BG );  g.fillRect( 0, 0, getWidth(), getHeight() );  if ( frameTime != oldFrameTime ) {    msMessage = frameTime + MESSAGE;    imageGraphics.setColor( COLOR_BG );    imageGraphics.fillRect( 0, 0, stringImage.getWidth(),                             stringImage.getHeight() );    imageGraphics.setColor( COLOR_FG );    imageGraphics.drawString( msMessage, 0, 0, textAnchor );  }  for ( int i  = 0 ; i < DRAW_COUNT ; i ++ ) {    g.drawImage( stringImage, getRandom( getWidth() ),                  getRandom( getHeight() ), graphicAnchor );  }  oldFrameTime = frameTime;}

The Profiler now shows that the time spent in OCanvas paint is down to 42.01% of the total. Comparing the frameTime across calls to paint() resulted in drawString() and fillRect() being called 69 times instead of 101. That's a decent savings, and there's not much more to go, but now it's time to get serious. The more you optimize, the harder it gets. Now we are down to scraping out the last pieces of superfluous, cycle-eating code. We're dealing with shaving off very small percentages, or even fractions of percentages now, but if we're lucky, they'll add up to something significant.

Let's start with something easy. Instead of calling getHeight() and getWidth(), let's call those methods once and cache the results outside our loop. Next, we're going to stop using Strings and do everything manually with a StringBuffer. We're then going to shave a little off the calls to drawImage() by restricting the drawing area through calls to Graphics.setClip(). Finally, we're going to avoid making calls to java.util.Random.nextInt() inside our loop.

Here are our new variables...

  private static final String MESSAGE = "ms per frame:";  private int iw, ih, dw, dh;  private StringBuffer stringBuffer;  private int messageLength;  private int stringLength;  private char[] stringChars;  private static final int RANDOMCOUNT = 256;  private int[] randomNumbersX = new int[RANDOMCOUNT];  private int[] randomNumbersY = new int[RANDOMCOUNT];  private int ri;

...and here is the new code for our constructor:

iw = stringImage.getWidth();ih = stringImage.getHeight();dw = getWidth();dh = getHeight();for ( int i = 0 ; i < RANDOMCOUNT ; i++ ) {  randomNumbersX[i] = getRandom( dw );  randomNumbersY[i] = getRandom( dh );}ri = 0;stringBuffer = new StringBuffer( MESSAGE+"000" );messageLength = MESSAGE.length();stringLength = stringBuffer.length();stringChars = new char[stringLength];stringBuffer.getChars( 0, stringLength, stringChars, 0 );

You can see we're pre-calculating Display and Image dimensions. We're also cacheing the results of 512 calls to getRandom(), and we've dispensed with the msMessage String in favor of a StringBuffer. The meat is still in the paint() method, of course:

  public void paint(Graphics g) {    g.setColor( COLOR_BG );    g.fillRect( 0, 0, dw, dh );    if ( frameTime != oldFrameTime ) {      stringBuffer.delete( messageLength, stringLength );      stringBuffer.append( (int)frameTime );      stringLength = stringBuffer.length();      stringBuffer.getChars( messageLength,                              stringLength,                              stringChars,                              messageLength );      iw = font.charsWidth( stringChars, 0, stringLength );      imageGraphics.setColor( COLOR_BG );      imageGraphics.fillRect( 0, 0, iw, ih );      imageGraphics.setColor( COLOR_FG );      imageGraphics.drawChars( stringChars, 0,                                stringLength, 0, 0, textAnchor );    }    for ( int i  = 0 ; i < DRAW_COUNT ; i ++ ) {      g.setClip( randomNumbersX[ri], randomNumbersY[ri], iw, ih );      g.drawImage( stringImage, randomNumbersX[ri],                    randomNumbersY[ri], textAnchor );      ri = (ri+1) % RANDOMCOUNT;    }    oldFrameTime = frameTime;  }

We're using a StringBuffer now to draw the characters of our message. It's easier to append characters to the end of a StringBuffer than to insert them at the beginning, so I've switched our display text around and the frameTime is now at the end of the message, e.g. "ms per frame:120". We're just writing over the last few frameTime characters each time, and leaving the message part intact. Using a StringBuffer explicitly like this saves the system from creating and destroying Strings and StringBuffers each time through our paint() method. It's extra work, but it's worth it. Note that I'm casting frameTime to an int. I found that using append(long) caused a memory leak. I don't know why, but it's a good example of why you should use the utilities to keep an eye on things.

We're also using font.charsWidth() to calculate the width of the message image so that we have to do the minimum of drawing. We've been using a proportional font, so the image for "ms per frame:1" will be smaller than the image for "ms per frame:888", and we're using Graphics.setClip() so we don't have to draw the extra. This also means we only have to fill a rectangle big enough to blank out the area we need. We're hoping that the drawing time we save makes up for the extra time spent calling font.charsWidth().

It may not make much of a difference here, but this is a great technique to use in a game for drawing a player's score on the screen. In that case, there's a big difference between drawing a score of 0 and a score of 150,000,000. This is hampered somewhat by the implementation's incorrect return values for font.getBaselinePosition(), which seems to return the same value as font.getHeight(). Sigh.

Finally, we're just looking up the pre-calculated "random" co-ordinates in our two arrays, which saves us making those calls. Note the use of the modulo operator to implement a circular array. Note also that we're using the textAnchor for drawing both the image and the string now so that setClip() works correctly.

We are now firmly in a gray area with respect to the numbers this version of the code produces. The profiler tells me that the code is spending roughly 7% more time in paint() than without these changes. The call to font.charsWidth() is probably to blame, weighing in at 4.6%. ( That's not great, but it could be reduced. Notice that we're retrieving the width of the MESSAGE string every time. We could easily calculate that ahead of the loop body and simply add it to the frameTime width. ) Also, the new call to setClip() is labelled 0.85%, and seems to significantly increase the percentage of time spent in drawImage ( to 33.94% from 27.58% ).

At this point, it looks like all this extra code must certainly be slowing things down, but the values generated by the application contradict with this assumption. Figures on the emulator fluctuate so much as to be inconclusive without running longer tests but my i85s reports that things are a little faster with the extra clipping code than without, coming in at 37130ms without either call to setClip() or charsWidth(), and coming in at 36540 with both. I ran this test as many times as I had patience for, and the results were solid. This highlights the issues of varying execution environments. Once you get to the point where you're not sure if you're making headway, you might be forced to continue all your testing on the hardware, which would require a lot of installing and uninstalling of JAR files.

So it looks like we've squeezed a lot more performance from our graphical routines. Now it's time to take the same high-level and low-level approaches with our work() method. Let's review that method:

  public synchronized int work( int[] n ) {    r = 0;    for ( int j = 0 ; j < DIVISOR_COUNT ; j++ ) {      for ( int i = 0 ; i < n.length ; i++ ) {        divisor = getDivisor(j);        r += workMore( n, i, divisor );      }    }    return r;  }

Every time through the loop in run() we're passing in our array of numbers. The outer loop in the work() method calculates our divisor, then calls workMore() to actually perform the division. All kinds of things are wrong here, as you can probably tell. For a start, the programmer has put the call to getDivisor() inside the inner loop. Given that the value of j does not change through the inner loop, the divisor is an invariant, and really belongs outside the inner loop.

But let's think about this some more. The call itself is completely unnecessary. This code does the same thing...

  public synchronized int work( int[] n ) {    r = 0;    divisor = 1;    for ( int j = 0 ; j < DIVISOR_COUNT ; j++ ) {      for ( int i = 0 ; i < n.length ; i++ ) {        r += workMore( n, i, divisor );      }      divisor *= 2;    }    return r;  }

...without that call to getDivisor(). Now our profiler is telling me that we are spending 23.72% in our run() method, versus 38.78% before we made these improvements. Always optimize with your head first before messing with low-level optimization tricks. With that said, let's take a look at some of those tricks.

Low-Level Optimization

All programmers are familiar with the concepts of the sub-routine and the function - separate out common code to avoid duplicating it in several places around your app. Unfortunately, this commonly "Good" programming practice can impact performance because method calls involve a certain amount of overhead. The easiest way to reduce the amount of time it takes to call a method is by carefully selecting its declaration modifiers. Our programmer has been very cautious, and has made his work() and workMore() methods synchronized, in case some other thread might call them simultaneously. This is all well and good, but if we're serious about performance, we often have to make sacrifices, and today we will be sacrificing safety.

Well, not really. We know for a fact that no-one else is going to be calling these methods, so we can de-synchronize them without too much worry. What else can we do? Take a look at this list of method types:

  • synchronized methods are the slowest, since an object lock has to be obtained
  • interface methods are the next slowest
  • instance methods are in the middle
  • final methods are faster
  • static methods are fastest

So we definitely shouldn't be making anything synchronized, and it looks like we can even mark work() and workMore() as final static methods. Doing this cut 1% from the time the emulator spent inside run().

Another factor that affects the performance of method calls is the number of parameters passed into the method. We are calling workMore() 51712 times, passing an int array and two ints into the method and returning an int each time. In this, somewhat trivial, example it's going to be easy to collapse the workMore() method into the the body of work() to avoid the call entirely. In the real world, that decision might be harder to make, especially if it means that code will be duplicated around your application. Test with the profiler and on the device to see how much difference it actually makes before taking that step. If you can't remove the method altogether, try to reduce the number of parameters you're passing into it. The more parameters, the greater the overhead.

  public final static int work( int[] n ) {    divisor = 1;    r = 0;    for ( int j = 0 ; j < DIVISOR_COUNT ; j++ ) {      for ( int i = 0 ; i < n.length ; i++ ) {        r += n[i] * n[i] / divisor + n[i];      }      divisor *= 2;    }    return r;  }

Wow! Removing the call to workMore() slashed the amount of time spent in run to 9.96%. From here on it will be uphill all the way. Let's look at two general optimization techniques - strength reduction and loop unrolling.

Strength reduction is where you replace a relatively slow operation with a fast one that does exactly the same thing. The most common example is using the bitshift operators, which are equivalent to multiplication and division by a power of two. For example, x >> 2 is equivalent to x / 4 ( 2 to the power of 2 ) and x << 10 is equivalent to x * 1024 ( 2 to the power of 10 ). By an amazing coincidence, our divisor is always a power of 2 (isn't that lucky!) so we can use bitshifting in place of division.

Unrolling loops reduces the overhead of flow control code by performing more than one operation each time through the loop, executing fewer iterations, or even removing the loop entirely. As our DIVISOR_COUNT is only 8, it's going to be easy to unroll our outer loop:

public final static int work( int[] n ) {  r = 0;  for ( int i = 0 ; i < n.length ; i++ ) { r +=  n[i] * n[i]  + n[i]; }  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 1) + n[i]; }  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 2) + n[i]; }  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 3) + n[i]; }  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 4) + n[i]; }  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 5) + n[i]; }  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 6) + n[i]; }  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 7) + n[i]; }  return r;}

Two important points. First, you'll notice that unrolling our loop requires us to duplicate some code. This is the last thing you want in J2ME, where developers are often fighting JAR bloat, but remember that the JARing process involves compression, and compression works best on repetitive data, so the above code might not make as big an impact on your jar size as you might think. Again, it's all about the trade-off. Your code can be small, fast, easy to read. Pick any two. The second point is that the bitshift operators have a different precedence from / and *, so you will often have to put parentheses around statements that otherwise would not need them.

Unrolling the loop and using the bitshift operators shaved off slightly more than 1%. Not bad. Now let's turn our attention to array access. Arrays are fast data structures in C, but for that reason they can also be dangerous - if your code starts accessing array elements beyond the end of the array, you're over-writing sections of memory you shouldn't be messing with and the consequences will most likely be dire.

In contrast, Java is a very safe language - running off the end of an array like that will simply throw an ArrayIndexOutOfBoundsException. The system checks for an invalid index value every time you access an array, which makes array access slower than in C. Again, there's nothing we can do about the internals of array handling in Java, but we can work around it by making smart decisions. In the code above, for example, we're accessing n[i] 24 times. We can omit a lot of those array accesses by storing the value of n[i] in a variable. A little high-level thought also reveals that we can rearrange things in a much smarter way like this...

  private static int divisor;  private static int r;  private static int ni;  public final static int work( int[] n ) {    r = 0;    for ( int i = 0 ; i < n.length ; i++ )  {      ni = n[i];       r +=  ni * ni + ni;      r +=  (ni * ni >> 1) + ni;      r +=  (ni * ni >> 2) + ni;      r +=  (ni * ni >> 3) + ni;      r +=  (ni * ni >> 4) + ni;      r +=  (ni * ni >> 5) + ni;      r +=  (ni * ni >> 6) + ni;      r +=  (ni * ni >> 7) + ni;    }    return r;  }

We've eliminated a whole lot of looping and array accessing. Let's now eliminate a whole lot of squaring using a technique known as common sub-expression elimination. We're calculating the square of n[i] eight times each time through our loop. We can eliminate those repetitive calculations by working out the square once at the beginning:

  private static int r;  private static int ni;  private static int nis;  public final static int work( int[] n ) {    r = 0;    for ( int i = 0 ; i < n.length ; i++ )  {      ni = n[i];       nis = ni * ni;      r +=  nis  + ni;      r +=  (nis >> 1) + ni;      r +=  (nis >> 2) + ni;      r +=  (nis >> 3) + ni;      r +=  (nis >> 4) + ni;      r +=  (nis >> 5) + ni;      r +=  (nis >> 6) + ni;      r +=  (nis >> 7) + ni;    }    return r;  }

...which cuts down the time spent in run() to 6.18%. Not bad at all. Before we move on, let me say a couple more things about arrays. A little high-level optimization ( i.e. "thought" ) might reveal that an array isn't necessarily the right data structure to use. Think about using a linked list or some other structure if that would increase performance. Secondly, if you are going to use an array and you ever need to copy its contents into another array, always use System.arraycopy(). It's way faster than trying to write your own routine to do the same thing. Finally, arrays are generally higher performance than the java.util.Vector object. If you require the kind of functionality that a Vector offers, think about writing it yourself and run tests to make sure your code is faster.

Okay, we're really running out of things to optimize. We just introduced another couple of variables to cache the array element and the value of that element squared. You may have been wondering why those variables have been declared outside the body of the method declaration. They're outside because even declaring ints takes a little time and we should keep that out of our loop, right? Wrong!

Assume nothing. It's true that we're saving time by avoiding the int declarations, but this code may actually be slower than if those variables were declared locally, inside the method. That's because local variables perform better, as the JVM takes longer to resolve variables declared outside a method. So let's turn them into local variables.

Finally, we can tweak our for() loop slightly. Computers are generally better at comparing a number to zero than to any other non-zero number. That means we can turn our loop upside down and rewrite the method like this so we're comparing against zero:

  public final static int work( int[] n ) {    int r = 0;    int ni;    int nis;    int i;    for ( i = n.length ; --i >= 0 ; ) {      ni = n[i];       nis = ni * ni;      r +=  nis  + ni;      r +=  (nis >> 1) + ni;      r +=  (nis >> 2) + ni;      r +=  (nis >> 3) + ni;      r +=  (nis >> 4) + ni;      r +=  (nis >> 5) + ni;      r +=  (nis >> 6) + ni;      r +=  (nis >> 7) + ni;    }    return r;  }

And that's it! This code may be a little faster, but the profiler results are less conclusive. What is clear is that this method is tougher to read. There may be further room for improvement, but let's instead take another look at our paint() method, to see if any of what we learned can be introduced there.

Remember what we learned about local variables? If you are forced to use an instance variable and you are referencing that variable multiple times inside a method, it might be worth introducing a local variable so that the JVM only has to resolve the reference once. You'll introduce a declaration and an assignment which will slow things down, but as a rule of thumb, we'll use that technique if an instance variable is referenced more than twice.

We can also use strength reduction in our paint() method to replace the modulo operator with a circular counter that uses a bit operator. This is only possible because the length of our random numbers cache array is (amazingly!) a power of two. Finally, we can juggle our comparisons to always compare with zero. Here's our new and improved paint() method:

  public void paint(Graphics g) {    StringBuffer sb = stringBuffer;    Graphics ig = imageGraphics;    char[] sc = stringChars;    int sl;    int ml = messageLength;    int ril = ri;    int iw = 0;    g.setColor( COLOR_BG );    g.fillRect( 0, 0, dw, dh );    if ( frameTime - oldFrameTime != 0  ) {      sb.delete( ml, stringLength );      sb.append( (int)frameTime );      sl = stringLength = sb.length();      sb.getChars( ml, sl, sc, ml );      iw = font.charsWidth( sc, 0, sl );      ig.setColor( COLOR_BG );      ig.fillRect( 0, 0, iw, ih );      ig.setColor( COLOR_FG );      ig.drawChars( sc, 0, sl, 0, 0, textAnchor );    }    for ( int i  = DRAW_COUNT ; --i >=0  ; ) {      g.setClip( randomNumbersX[ril], randomNumbersY[ril], iw, ih );      g.drawImage( stringImage, randomNumbersX[ril],                    randomNumbersY[ril], textAnchor );      ril++;      ril &= 255;    }    ri = ril;    oldFrameTime = frameTime;  }

Again, we are at the end of the road with respect to profiler results. These changes don't affect the number of times the graphics routines are called, so the difference will be minor at best. But when we combine all the changes we made to our work() method and load the new JAR onto the device, the difference is huge. My motorola i85s now completes the test in 14030ms - over twice as fast!

There is one last change to make to the code. I have left it to the end because it uses a method that is not particularly well documented and my experience has been that its behavior varies across implementations. Looking at the start() and run() methods on OCanvas, you can see that I've been using a separate animation thread. This is the traditional way of doing animation in Java. An issue with using this technique for games is that we are forced to pause every time we iterate through the loop to allow system events, such as key presses and command actions to be delivered. We call the wait() method in a synchronized block to make this happen. This is hardly optimized code. After all our hard work optimizing everything else, we're actually stopping to do nothing right in the thick of things. What's worse, it's not easy to arrive at a good figure for WAIT_TIME. If we wait() too long, the game slows down. If we don't wait() for long enough, key presses are missed and the game stops being responsive to user input.

J2ME provides a solution to this problem by using the Display.callSerially() method. The API states that callSerially( Runnable r ) "causes the Runnable object r to have its run() method called later, serialized with the event stream, soon after completion of the repaint cycle". By using callSerially() we can miss out the call to wait() entirely. The system will ensure that our work() and paint() methods are called in synch with the user input routines, so our game will remain responsive. Here are the new methods...

  public void start() {    started = frameStarted = System.currentTimeMillis();    loopCounter = result = 0;    finished = false;    exitStatus = EXIT_DONE;    run();  }  public void run() {    frameTime = System.currentTimeMillis() - frameStarted;    frameStarted = System.currentTimeMillis();    if ( midlet.running && !finished ) {      result += work( numbers );      repaint();      display.callSerially(this);      loopCounter++;      finished = ( loopCounter > LOOP_COUNT );    }    else {      elapsed = System.currentTimeMillis() - started;      midlet.exitCanvas( exitStatus );    }  }

...plus we have to declare and get a handle on the Display:

    Display display = Display.getDisplay( midlet );

Without the call to wait() my i85s now runs this code in 10180ms - around a 40% savings. You might expect a greater increase in performance, given that we just eliminated 100 calls that wait() 50ms, but remember this technique is also about responsiveness to user input.

Again, let me stress that this method of animation should be used with caution. I had trouble getting it to work on NOKIA, and all of their example code (even game code) uses the wait() technique instead. Even on Motorola I had problems using callSerially() if I added Command objects to the animated Canvas. Test carefully before you try this at home.

Other Techniques

One technique I was unable to include in my example code was the optimal use of a switch() statement. Switches are very commonly used to implement Finite State Machines, which are used in game Artificial Intelligence code to control the behavior of non-player actors. When you use a switch, it is good programming practice to write code like this:

  public static final int STATE_RUNNING = 1000;  public static final int STATE_JUMPING = 2000;  public static final int STATE_SHOOTING = 3000;  switch ( n ) {    case STATE_RUNNING:      doRun();    case STATE_JUMPING:      doJump();    case STATE_SHOOTING:      doShoot();  }

There's nothing wrong with this, and the int constants are nice and far apart, in case we might want to stick another constant in between RUNNING and JUMPING, like STATE_DUCKING = 2500. But apparently switch statements can be compiled into one of two byte codes, and the faster of the two is used if the ints used are close together, so this would be better:

  public static final int STATE_RUNNING = 1;  public static final int STATE_JUMPING = 2;  public static final int STATE_SHOOTING = 3;

There are also some optimizations you can perform when using a Fixed Point math library. First, if you're doing a lot of division by a single number, you should instead work out the inverse of that number and perform a multiplication. Multiplication is slightly quicker than division. So instead of...

  int fpP = FP.Div( fpX, fpD );  int fpQ = FP.Div( fpY, fpD );  int fpR = FP.Div( fpZ, fpD );

...you should rewrite it like this:

  int fpID = FP.Div( 1, fpD );  int fpP = FP.Mul( fpX, fpID );  int fpQ = FP.Mul( fpY, fpID );  int fpR = FP.Mul( fpZ, fpID );

If you're performing hundreds of divisions every frame, this will help. Secondly, don't take your FP math library for granted. If you have source for it, open it up and take a look at what's going on in there. Make sure all the methods are declared final static and look for other opportunities to improve the code. For example, you may find that the multiplication method has to cast both ints to longs and then back to an int:

public static final int Mul (int x, int y) {  long z = (long) x * (long) y;  return ((int) (z >> 16));}

Those casts take time. Collision detection using bounding circles or spheres involves adding the squares of ints together. That can generate some big numbers that might overflow the upper bound of your int Fixed Point data type. To avoid this, you could write your own square function that returns a long:

    public static final long Sqr (int x) {      long z = (long) x;      z *= z;      return (z >> 16);    }

This optimized method avoids a couple of casts. If you're doing a great deal of Fixed Point math, you might consider replacing all of the library calls in the main game loop with the long-hand math. That will save a lot of method calls and parameter passing. You may also find that when the math is written out manually you can reduce the number of casts that are required. This is especially true if you are nesting several calls to your library, e.g.

  int fpA = FP.Mul( FP.toInt(5),                     FP.Mul( FP.Div( 1 / fpB ),                    FP.Mul( FP.Div( fpC, fpD ),                    FP.toInt( 13 ) ) ) );

Take the time to unravel nested calls like this and see if you can reduce the amount of casting. Another way to avoid casting to longs is if you know that the numbers involved are small enough that they definitely won't cause an overflow.

To help with high-level optimization, you should look for articles on game programming. A lot of the problems presented by game programming such as fast 3D geometry and collision detection have already been solved very elegantly and efficiently. If you can't find Java source, you will almost certainly find C source or pseudo-code to convert. Bounds checking, for example, is a common technique that we could have used inside our paint() method. Instead of clearing the entire screen every time, we really only need to clear the section of the screen that changes from frame to frame. Because graphics routines are relatively slow you will find that the extra housekeeping required to keep track of which parts of the screen need to be cleared is well worth the effort.

Some phone manufacturers offer proprietary APIs that help programmers get around some of the limitations J2ME presents, such as lack of sound, lack of Image transparency, etc. Motorola, for example, offers a floating point math library that uses floating point math instructions on the chip. This library is much faster than the fastest Fixed Point math library, and a lot more accurate. Using these libraries completely destroys the portability of your code, of course, but they may be an option to consider if deployment on many different handsets is not a concern.

Conclusions

  • Only optimize code if you need to
  • Only optimize where it counts
  • Use the profiler to see where to optimize
  • The profiler won't help you on the device, so use the System timer on the hardware
  • Always study your code and try to improve the algorithms before using low-level techniques
  • Drawing is slow, so use the Graphics calls as sparingly as possible
  • Use setClip() where possible to minimize the drawing area
  • Keep as much stuff as possible out of loops
  • Pre-calculate and cache like crazy
  • Strings create garbage and garbage is bad so use StringBuffers instead
  • Assume nothing
  • Use static final methods where possible and avoid the synchronized modifier
  • Pass as few parameters as possible into frequently-called methods
  • Where possible, remove method calls altogether
  • Unroll loops
  • Use bit shift operators instead of division or multiplication by a power of two
  • You can use bit operators to implement circular loops instead of modulo
  • Try to compare to zero instead of any other number
  • Array access is slower than C, so cache array elements
  • Eliminate common sub-expressions
  • Local variables are faster than instance variables
  • Don't wait() if you can callSerially()
  • Use small, close constants in switch() statements
  • Look inside your Fixed Point math library and optimize it
  • Unravel nested FP calls to reduce casting
  • Division is slower than multiplication, so multiply by the inverse instead of dividing
  • Use tried and tested algorithms
  • Use proprietary high-performance APIs with care to preserve portability

Where to next?

Optimization is a black art. At the heart of any computer lies the CPU and at the heart of Java lies a virtual CPU, the JVM. To squeeze the last ounce of performance from the JVM, you need to know a lot about how it functions beneath the hood. Specifically, you need to know what things the JVM can do fast, and what it does slowly. Look for sites with solid information on the inner workings of Java. You don't necessarily have to learn how to program in byte code, but the more you know, the easier it will be to come up with new ways to optimize your applications for performance.

There's no substitute for experience. In time you will discover your own secrets about the performance characteristics of J2ME and of the handsets you are developing for. Even if you can't code around certain idiosynchrasies, you could design your next game around them. While developing my game I found that calling drawImage() five times to draw five images of 25 pixels each is much slower than calling it once to draw an image five times the size. That knowledge will definitely help shape my next game.

Good luck, and have fun.

Resources:

  1. J2ME's official web site contains the latest on what's happening on this front.
  2. Like wireless games? Read the Wireless Gaming Review.
  3. Discuss J2ME Game Development at j2me.org
  4. A great site on many aspects of Java Optimization
  5. Another great site on Optimization
  6. Many articles on J2ME performance tuning
  7. The amazing Graphics Programming Black Book by Michael Abrash
  8. The Art of Computer Game Design by Chris Crawford

About the Author:

Mike Shivas has been playing video games since before the advent of the 8-bit home microcomputer. He has been programming in Java since 1996, has consulted for MasterCard on wireless solutions and is the published author of several J2ME video games. Readers may contact Mike at mshivas@hotmail.com.

# # #

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date