dcsimg
December 6, 2016
Hot Topics:

The Persistent Hashtable: A Quick-and-Dirty Database

  • February 14, 2001
  • By Greg Travis
  • Send Email »
  • More Articles »

I think the thing I hate most in programming is writing the same piece of code twice. Even writing something similar to something I've already written bugs me.

Larry Wall, the creator of Perl, once said: "The three principal virtues of a programmer are Laziness, Impatience, and Hubris." The combination of these virtues is what makes it annoying to have to write something more than once. And the only way to remove this annoyance is to make yourself a simple useful library that you can reuse.

This article describes the implementation of a simple library for doing something that many programs have to do: save data. It's not going to solve all of our data-saving needs, because, regardless of what some database vendors might tell you, there is no one piece of software that can properly handle every data-saving situation.

The software we're going to be talking about is called a Persistent Hashtable, and it's going to fill a specific niche in the world of data-saving, which is a very broad world.

At one extreme, you have the filesystem. A filesystem can store just about anything, but you have to make all the decisions about how to save your data by yourself and write the code anew for each application.

At the other extreme, you have your high-end database engines. These, too, can store just about anything, and they take care of many things (data safety, concurrency, efficiency).

But what if you just need to save a few strings or numbers — configuration data, for example? Using the filesystem means you have to invent a data format, and using a database involves big libraries, database engines, and SQL queries. Annoying.

What we'll create is something whose API is so simple, you'll find it to be the perfect answer to the question, "How can I save these bits of data without too much hassle?"

Here's the API in action. To save some data, you do this:

  persistentHash.put( "hostname", "www.developer.com" );

To load some data, you do this:

  String hostname = persistentHash.getString( "hostname" );

And that's almost all you'll have to do.

Semantics

As I mentioned before, there's no such thing as a completely general persistence mechanism. Each persistence mechanism imposes a semantic framework on the data we put into it. So the first thing we have to do is explore the semantic framework that we'll be using in our useful little library.

We're calling it a hashtable, but, technically, this isn't really the right name for it. Technically, a hashtable is a dictionary that uses a hashing function for efficiency, and a dictionary is a data structure that maps strings to objects. Whether or not we use a hashing function to optimize access to our dictionary is not really relevant to the semantics, so a more precise name would be a Persistent Dictionary. But dictionary has enough real-world connotation of its own that hashtable is probably a better name to use.

So what we're creating is really a map from strings to objects. This allows you to store an object under a particular name (e.g., "hostname"), and then to recall that object using the same name.

We'll use the term "key" to refer to the name that we're storing the object under and the term "value" to refer to the object that we're storing.

Our implementation is not going to store just any kind of object — we're going to allow only strings and integers, because to handle arbitrary objects is far more complicated, and requires choices concerning many tradeoffs that are beyond the scope of this article.

To summarize, we are creating a library that lets us store integers and string values under names, and these values will be stored on the disk between invocations of our program.

Usage

Before we can use a PHash, we have to create it. Since the data is ultimately going to be stored in files, we need a place to put them. So we associate a directory with each PHash. Not only does this keep the files in one place, and out of our hair, but it also will allow us to have multiple PHashes without worrying that their disk files will interfere with each other.

Thus, in order to access a PHash, we need to supply a directory name. It doesn't really matter what the directory name is, as long as we use it consistently to access our data.

  PHash ph = new PHash( "datadir" );

Now, we can access the data as we did in our earlier example:

  ph.put( "hostname", "www.developer.com" );
  String hostname = ph.getString( "hostname" );

Let's go through a complete example that uses our library. This example is going to load up a string value, add a character to the end, and store it back to the PHash. Likewise, it's going to load up an integer value, increment it, and store it back to the PHash.

Our test program, Test.java

, goes like this:

    // Open up a PHash with the name "datadir"
    PHash ph = new PHash( "datadir" );

    // Load up a couple of values
    String s = ph.getString( "a string" );
    int i = ph.getInt( "an integer" );

    // Show the values
    System.out.println( "Got "+s+" "+i );

    // Change the values
    s = s + ".";
    i++;

    // Store them back to the PHash
    ph.put( "a string", s );
    ph.put( "an integer", i );

Disk Format

Now that we've explored the semantics of PHash, let's see how we're storing things on disk. It's not something we should be concerned about when we're using it, but from an implementation point of view, it's crucial.

As mentioned earlier, each PHash has its own directory for its files to live in. In the example above, we used a directory called "datadir".

Each key/value pair gets its own file. The key, in modified form, is stored as the name of the file, and the value is stored in the file itself.

The reason we have to modify the key is this: Our semantic specification said that a PHash key can be any string. At the same time, filesystems generally have restrictions on the kinds of characters that can be used in a filename. These semantics do not match, so we need to convert the more general keys format to the more restrictive filename format.

The solution for this is simple. It can safely be assumed that all filesystems that might have Java running off of them allow alphanumeric characters, along with the percent sign ("%"). Thus, we use an encoding not unlike that used for URLs. We leave all alphanumeric characters alone, and we convert everything else to a two-digit hexadecimal number prefixed by a "%".

In our example test program above, we used a directory called "datadir" and two keys, "a string" and "an integer". Here's what the filenames look like:

datadir
datadir/a%20string
datadir/an%20integer

Now that we know what the files are called, we should consider their contents. We use the DataOutputStream and DataInputStream classes to store and retrieve data, because these classes provide encoding for basic data types like integers and strings that are fully cross-platform.

Each file contains first an integer describing the type of value stored in it — 1 for a string and 2 for an integer. Following this integer is the value itself. We don't need to know the actual format used by the Data streams, since the DataInputStream will take care of parsing that for us. We only need to know that each file contains an integer followed by either an integer or a string.

Caching

It's slow to go to disk every time you want to read a piece of data. Since some pieces of data might be read multiple times, it's useful to have a read-cache.

A read-cache keeps the value of a key/value pair in memory after its been read from disk the first time. This way, if the program asks again for the same value, it doesn't have to go to disk.

When the PHash object starts up, it scans its directory for files, and converts their names to keys. It thus acquires a list of all the keys in the PHash.

These keys are then stored in a Hashtable. Note that by this we mean a java.util.Hashtable object, not a PHash. This hashtable acts as the cache.

This hashtable maps each of these keys to a special object called "notLoaded". This object serves to denote that the value has not been read from disk. Here's part of the initialization code:

    // Open the directory
    File dir = new File( dataDirectory );

    // ...

    // Read all the keys in the directory, and store them in the
    // cache with a value indicating that the values aren't yet loaded
    String files[] = dir.list();
    for (int i=0; i<files.length; ++i) {
      cache.put( decodeKey( files[i] ), notLoaded );

When the program asks the PHash for a particular value, it checks the cache to see if there is an entry for the given key:

    // Error if the item isn't there at all
    if (!cache.containsKey( key ))
      throw new IOException( "PHash does not contain \""+key+"\"" );

If not, it's simply an error — you can't read a value that isn't there.

If the key is in fact in the cache, but points to the "notLoaded" object, then the PHash must load the value from disk. This value is stored in the cache so that the next time the value is asked for, it doesn't need to be loaded.

    // Check to see if the value has been loaded
    // If not, read it off disk into the cache
    Object value = cache.get( key );
    if (value == notLoaded) {
      // Go to disk for the value
      value = readCacheElement( key );

      // Store it in the cache for future reads
      cache.put( key, value );
    }

Conclusion

We've described, in detail, the implementation and usage of a simple persistence library. Falling somewhere between a real database engine and custom disk I/O routines, it should help with many applications that need to store data between invocations.

Source Code

About the Author

Greg Travis is a freelance programmer in New York City. His interest in computers can probably be traced back to that episode of "The Bionic Woman" where Jamie runs around trying to escape a building whose lights and doors are controlled by an evil artificial intelligence that mocks her through loudspeakers. He's a devout believer in the religious idea that, when a computer program works, it's a complete coincidence. He can be reached at mito@panix.com.






Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date
Rocket Fuel