http://www.developer.com/

Back to article

The Secret of Soundex


November 19, 2002

Readers of my weekly comments section in the VB-World newsletter know I'm a philosophical kinda guy, with a humorous, thought provoking notion put forward each and every week.

In the past, we've pondered over such puzzlers as:

  • How come wrong numbers are never busy?
  • Why doesn't Tarzan have a beard?
  • If Barbie is so popular, why do you have to buy all her friends?
  • When they invented the clock, how did they know what time to set it to?

Thankfully this article gives me an opportunity to throw a few other, slightly more relevant teasers your way:

  • How do spell checkers figure out which words to use as a replacement?
  • How do the telephone enquiry folk always manage to locate the number you require, even when you aren't sure overly sure of the name spelling?
  • How does fuzzy, "phonetical" searching work?
  • How do Tic Tac's provide so much freshness in just two calories?

The answer to the first three questions can all be attributed to one very special system we're going to learn in this article. And it goes by the name of Soundex.

That fourth question however... well, I can only say that I have absolutely no idea how the cool, minty flavour of those low-fat fresheners manage to provide such a whoppingly good service to the mouth with such little impact on the waistline!

<Karl's Swiss bank account increases by #10,000 due to his subtle commercial plug>

So let's get on with the show and explore the secret of Soundex...

Now my little Visual Basic pumpkins, it's time for a history lesson. And it'll be about as exciting as my bedside clock, originally sculptured in the shape of a bedside clock.

You see, back in the 19th century, the US National Archive folk were sitting around and realised they had a problem.

The president was breathing down their necks for the latest census reports. But listing every single name on the reports was a bit space consuming and would have required more paper than potentially existed in the Amazon either the rainforest or online bookstore.

So they decided to group the many variations of names together depending on how they sounded. So the surname 'Moore' sounds a lot like 'Mower'. So 'Mower' would probably get classified under the same heading as 'Moore'.

And 'Mour', 'Moor' and 'Moooooore' are all similar-sounding variations, strange though they may be. So each would get consolidated under the general heading of 'Moore'.

But the problem these Archive chappies had was how they could clearly define which words matched phonetically.

Yes - Moore, Mower, Mour, Moor and Mooooore - all sound the same. But the boffins wanted a system; a categorical method of determining whether two names actually sound the same.

And hence, Soundex was born.

Soundex is an algorithm that follows a set number of rules to produce a four-letter code for any word. The theory is that two words sounding roughly the same will produce the same four-letter code.

So 'Moore' has the code M600. And so does 'Mower'. And 'Mour', 'Moor' and 'Moooooore'. Groovy, eh?

Hold on a minute- hey, are you chewing gum? This is a history lesson! Spit it out and instead, try the new orange flavoured Tic Tac's more freshness, less fattening. Yummy!

<Karl's bank balance increases by another #10,000>

Now imagine the real world implications of this. If you had a surname field in a database*, you could add another to hold its Soundex code. And that would make 'fuzzy' phonetical searching exceptionally easy.

And if you wanted to create a spell checker, you'd simply need to create a database full of correctly spelled words and their related Soundex codes. If the user taps in a word that isn't in the database, your program would simply need to lookup words with the same Soundex code.

In short, Soundex allows you to give your applications an in-built intelligence, a knowledge of how words are spoken rather than spelt. Next up, we're going to learn how the Soundex code is generated, as well as taking a peek at a sample VB project...

* Note: It's worth pointing out that SQL Server inherently supports Soundex. Here's a sample SQL statement which retrieves all records where the AU_LNAME field sounds like 'Green': SELECT * FROM authors WHERE Soundex(AU_LNAME) LIKE Soundex('Green') - For more information, lookup SOUNDEX in SQL Server Books Online

This section is for all those folks interested in exactly how the four-letter Soundex code is generated.

But I admit, most of you probably won't be bothered about the theory.

So if you're just looking for the hands-on stuff, or have patience shorter than a toad's wedding tackle scroll on down and zip off to the next page. You don't really need to know all this stuff to get our sample VB project working anyhow.

But for the more boringly serious among us, here's how that four-letter code is generated:

The Soundex Rules

  • Take the first letter of the word and make it the first letter of the Soundex code
  • For each remaining letter in the word, grab its number from the below table and add it to the Soundex code
  • If two or more letters with the same number appear next to one another, only one of them should be added to the code
  • If the final Soundex code is greater than four characters, trim it down. If it's longer, append zeros until is has a length of four

The Soundex Table

B, P, F, V1
C, S, G, J, K, Q, X, Z2
D, T3
L4
M, N5
R6

The letters A,E,I,O,U,Y,H and W, as well as other characters are not counted.

All of these rules have been put together into two main Visual Basic functions. And we're going to take a peek at them, next...

Let's take a look at a Visual Basic algorithm that follows the rules set out on the previous page. The main function is called Soundex, which accepts any word and returns the four-letter Soundex code.

Here it is, commented and all:

Public Function Soundex(Word As String) As String    Dim strCode As StringDim strChar As StringDim lngWordLength As LongDim strLastCode As String' Grabs the first letterstrCode = UCase(Mid$(Word, 1, 1))strLastCode = GetSoundCodeNumber(strCode)' Stores the word lengthlngWordLength = Len(Word)' Continues the code, starting at the second letterFor i = 2 To lngWordLengthstrChar = GetSoundCodeNumber(UCase(Mid$ _                                    (Word, i, 1)))' If adjacent numbers are the same,' only count one of themIf Len(strChar) > 0 And strLastCode <> _           strChar Then strCode = strCode & strCharEnd IfstrLastCode = strCharNext' Trim it down to a maximum of four characters...Soundex = Mid$(strCode, 1, 4)' ... but if it's less than four characters, pad' it out with a bunch of zeros...If Len(strCode) < 4 ThenSoundex = Soundex & String(4 - Len(strCode), "0")End IfEnd FunctionPrivate Function GetSoundCodeNumber(Character As String) _        As String    ' Accepts a character and returns the' appropriate number from the Soundex tableSelect Case CharacterCase "B", "F", "P", "V"   GetSoundCodeNumber = "1"Case "C", "G", "J", "K", "Q", "S", "X", "Z"   GetSoundCodeNumber = "2"Case "D", "T"GetSoundCodeNumber = "3"Case "L"GetSoundCodeNumber = "4"Case "M", "N"GetSoundCodeNumber = "5"Case "R"GetSoundCodeNumber = "6"End SelectEnd Function

To test out the code, let's quickly knock together a sample project.

  • Launch Visual Basic
  • Create a new Standard EXE project
  • Paste the Soundex algorithm behind Form1

This simple project will take the values from two Text Boxes, compare them and state whether or not their Soundex codes match.

  • Add two Text Boxes, a Command Button and a Label to your Form, positioning them a little like this:

Ignore regular naming conventions for now. Go on, I dare you. Hey, you've got to have a bit of fun! We're young, innocent...

  • Add this code behind your Command Button:
Private Sub Command1_Click()    Text1.Tag = Soundex(Text1.Text)    Text2.Tag = Soundex(Text2.Text)        If Text1.Tag = Text2.Tag Then        Label1.ForeColor = vbRed - 100        Label1.Caption = "The Soundex codes match!"    Else        Label1.ForeColor = vbBlue        Label1.Caption = "The Soundex codes don't match"    End IfEnd Sub

That's it! You've created your simple test application.

  • Press F5 to run

To test the project, enter two values such as 'JON' and 'JOHN' and hit the Command Button. You might also want to experiment with:

  • Visual, Vezual
  • Richard, Ricardo
  • Forrest, Forest, Forrester
  • Checker, Chequer
  • Sideroad, Syde-rowd
  • Coronation, Carnation

In this article, we took a brief look at Soundex. We found out what it is, uncovered its historic roots, plus figured out how it can be used within modern Visual Basic applications.

Now the power of Soundex is in your hands. Build your own spell checker, add new table fields containing Soundex codes, throw a little extra intelligence into your program. It's up to you.

But whatever you get up to, this is Karl Moore wishing you all the best. From me, goodnight for tonight. Goodnight!

Oh, and don't forget to try out those new Tic Tac's! Minty freshness all day long!

<Kerrching!>

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date