dcsimg
December 7, 2016
Hot Topics:

More Game Coding Tidbits and Style That Saved My Butt

  • April 1, 2005
  • By Mike McShaffry
  • Send Email »
  • More Articles »

Optimizing Memory Access

Every access to system RAM uses a CPU cache. If the desired memory location is already in the cache, the contents of the memory location are presented to the CPU extremely quickly. If, on the other hand, the memory is not in the cache, a new block of system RAM must be fetched into the cache. This takes a lot longer than you'd think.

A good testbed for this problem uses multidimensional arrays. C++ defines its arrays in row major order. This ordering puts the members of the right most index next to each other in memory.

TestData [0][0][0] and TestData [0][0][1] are stored in adjacent memory locations.

Gotcha!

Not every language defines arrays in row order. Some versions of PASCAL define arrays in column order. Don't make assumptions unless you like writing slow code.

If you access an array in the wrong order, it will create a worst case CPU cache scenario. Here's an example of two functions that access the same array, and do the same task. One will run much faster than the other:

const int g_n = 250;
float TestData [g_n][g_n][g_n];

inline void column_ordered()
{
   for (int k=0; k<g_n; k++)
      for (int j=0; j<g_n; j++)
         for (int i=0; i<g_n; i++)
            TestData [i][j][k] == 0.0f;
}

inline void row_ordered()

{
   for (int i=0; i<g_n; i++)
      for (int j=0; j<g_n; j++)
         for (int k=0; k<g_n; k++)
            TestData [i][j][k] == 0.0f;
}

The timed output of running both functions on my test machine showed that accessing the array in row order was nearly nine times faster:

Column Ordered=2817 ms    Row Ordered=298 ms    Delta=2519 ms

Any code that accesses any largish data structure can benefit from this technique. If you had a multistep process that affected a large data set, try to arrange your code to perform as much work in smaller memory blocks. You'll optimize the use of the L2 cache and make a much faster piece of code.

Memory Alignment

The CPU reads and writes memory-aligned data much faster than other data. Any N-byte data type is memory aligned if the starting address is evenly divisible by N. For example, a 32-bit integer is memory aligned if the starting address is 0x04000000. The same 32-bit integer is unaligned if the starting address is 0x04000002, since the memory address is not evenly divisible by 4 bytes.

You can perform a little experiment in memory alignment and how it affects access time by using example code like this:

#pragma pack(push, 1)
struct ReallySlowStruct
{
   char c : 6;
   __int64 d : 64;
   int b : 32;
   char a : 8;
};

struct SlowStruct
{
   char c;
   __int64 d;
   int b;
   char a;
};

struct FastStruct
{
   __int64 d;
   int b;
   char a;
   char c;
   char unused [2];
};

#pragma pack(pop)

I wrote a piece of code to perform some operations on the member variables in each structure. The difference in times is as follows:

Really slow=417 ms
Slow=222 ms
Fast=192 ms

Your penalty for using the SlowStruct over FastStruct is about 14% on my test machine. The penalty for using ReallySlowStruct is code that runs twice as slow.

The first structure isn't even aligned properly on bit boundaries, hence the name ReallySlowStruct. The definition of the 6-bit char variable throws the entire structure out of alignment. The second structure, SlowStruct, is also out of alignment but at least the byte boundaries are aligned. The last structure, FastStruct, is completely aligned for each member. The last member, unused, ensures that the structure fills out to an eight-byte boundary in case someone declares an array of FastStruct.

Notice the #pragma pack(push, 1) at the top of the source example? It's accompanied by a #pragma pack(pop)at the bottom. Without them, the compiler, depending on your project settings, will choose to spread out the member variables and place each one on an optimal byte boundary. When the member variables are spread out like that the CPU can access each member quickly, but all that unused space can add up. If the compiler were left to optimize SlowStruct by adding unused bytes each structure would be 24 bytes instead of just 14. Seven extra bytes are padded after the first char variable, and the remaining bytes are added at the end. This ensures the entire structure always starts on an eight byte boundary. That's about 40% wasted space, all due to a careless ordering of member variables.

Don't let the compiler waste precious memory space. Put some of your brain cells to work and align your own member variables. You don't get many opportunities to save memory and optimize CPU at the same time.

Virtual Memory

Virtual memory increases the addressable memory space by caching unused memory blocks to the hard disk. The scheme depends on the fact that even though you might have a 500Mb data structure, you aren't going to be playing with the whole thing at the same time. The unused bits are saved off to your hard disk, until you need them again. You should be cheering and wincing at the same time. Cheering because every programmer likes having a big memory playground, and wincing because anything involving the hard disk wastes a lot of time and, last time I checked, PS2s and GameCubes didn't sport harddrives out of the box.

Just to see how bad it can get, I took the code from the array access example and modified it to iterate through a three dimensional array 500 elements cubed. The total size of the array would be 476Mb, much bigger than the installed memory on the test machine. A data structure bigger than available memory is sometimes called out-of-core. I ran the column_ordered()function and went to lunch. When I got back about 30 minutes later the test program was still chugging away. The hard drive was seeking like mad, and I began to wonder whether my hard disk would give out. I became impatient and re-ran the example and timed just one iteration of the inner loop. It took 379.75 seconds to run the inner loop. The entire thing would have taken over 50 hours to run. I'm glad I didn't wait.

Remember that the original array, 250 elements cubed, ran the test code in 298 ms when the fast row_ordered() function was used. The large array is only 8 times bigger, giving an expectation that the same code should have run in 2384 ms, or just under two and a half seconds.

Compare 2384 ms with 50 hours and you'll see how virtual memory can work against you if your code accesses virtual memory incorrectly.

Gotcha!

Any time a cache is used inefficiently you can degrade the overall performance of your game by many orders of magnitude. This is commonly called "thrashing the cache" and is your worst nightmare. The solution to this problem might be as simple as reordering a few lines of code. It might also be as heinous as reducing your data set.





Page 2 of 4



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date
Rocket Fuel