We have all become spoiled by search engines’ ability to “workaround” things like spelling mistakes, name spelling differences, or any other situation where the search term may match on pages whose authors may prefer to use a different spelling of a word. Adding such features to our own database-driven applications can similarly enrich and enhance our applications, and while commercial relational database management systems (RDBMS) offerings provide their own fully developed customized solutions to this problem, the licensing costs of these tools can be out of reach for smaller developers or small software development firms.
One could argue that this could be done using a spell checker instead. However, a spell checker is typically of no use when matching a correct, but alternative, spelling of a name or other word. Matching by sound fills this functional gap. That is the topic of today’s programming tutorial: how to query sounds with Python using Metaphones.
Read: Image Recognition in Python
What is Soundex?
Soundex was developed in the early 20th century as a means for the US Census to match names based on how they sound. It was then used by various phone companies to match customer names. It continues to be used for phonetic data matching to this day in spite of it being limited to American English spellings and pronunciations. It is also limited to English letters. Most RDBMS, such as SQL Server and Oracle, along with MySQL and its variants, implement a Soundex function and, in spite of its limitations, it continues to be used to match many non-English words.
What is a Double Metaphone?
The Metaphone algorithm was developed in 1990 and it overcomes some of the limitations of Soundex. In 2000, an improved follow-on, Double Metaphone, was developed. Double Metaphone returns a primary and secondary value which corresponds to two ways a single word could be pronounced. To this day this algorithm remains one of the better open-source phonetic algorithms. Metaphone 3 was released in 2009 as an improvement to Double Metaphone, but this is a commercial product.
Unfortunately, many of the prominent RDBMS mentioned above does not implement Double Metaphone, and most prominent scripting languages do not provide a supported implementation of Double Metaphone. However, Python does provide a module that implements Double Metaphone.
The examples presented in this Python programming tutorial use MariaDB version 10.5.12 and Python 3.9.2, both running on Kali/Debian Linux.
Read: How to Create Your First Python GUI Application
How to Add Double Metaphone to Python
Like any Python module, the pip tool can be used to install Double Metaphone. The syntax depends on your Python installation. A typical Double Metaphone install looks like the following example:
# Typical if you have only Python 3 installed $ pip install doublemetaphone # If your system has Python 2 and Python 3 installed $ /usr/bin/pip3 install DoubleMetaphone
Note, that the extra capitalization is intentional. The following code is an example of how to use Double Metaphone in Python:
# demo.py import sys # pip install doublemetaphone # /usr/bin/pip3 install DoubleMetaphone from doublemetaphone import doublemetaphone def main(argv): testwords = ["There", "Their", "They're", "George", "Sally", "week", "weak", "phil", "fill", "Smith", "Schmidt"] for testword in testwords: print (testword + " - ", end="") print (doublemetaphone(testword)) return 0 if __name__ == "__main__": main(sys.argv[1:]) Listing 1 - Demo script to verify functionality
The above Python script gives the following output when run in your integrated development environment (IDE) or code editor:
Figure 1 – Output of Demo Script
As can be seen here, each word has both a primary and secondary phonetic value. Words that match on both primary or secondary values are said to be phonetic matches. Words that share at least one phonetic value, or which share the first couple of characters in any phonetic value, are said to be phonetically near to one another.
Most letters displayed correspond to their English pronunciations. X can correspond to KS, SH, or C. 0 corresponds to the th sound in the or there. Vowels are only matched at the beginning of a word. Because of the uncountable number of differences in regional accents, it is not possible to say that words can be an objectively exact match, even if they have the same phonetic values.
Comparing Phonetic Values with Python
There are numerous online resources that can describe the full workings of the Double Metaphone algorithm; however, this is not necessary in order to use it because we are more interested in comparing the calculated values, more than we are interested in calculating the values. As stated earlier, if there is at least one value in common between two words, it can be said that these values are phonetic matches, and phonetic values that are similar are phonetically close.
Comparing absolute values is easy, but how can strings be determined to be similar? While there are no technical limitations that stop you from comparing multi-word strings, these comparisons are usually unreliable. Stick to comparing single words.
Read: Text Scraping in Python
What Are Levenshtein Distances?
The Levenshtein Distance between two strings is the number of single characters that must be changed in one string in order to make it match the second string. A pair of strings that have a lower Levenshtein distance are more similar to each other than a pair of strings that have a higher Levenshtein distance. Levenshtein Distance is similar to Hamming Distance, but the latter is limited to strings of the same length, as the Double Metaphone phonetic values can vary in length, it makes more sense to compare these using the Levenshtein Distance.
Python Levenshtein Distance Library
Python can be extended to support Levenshtein Distance calculations via a Python Module:
# If your system has Python 2 and Python 3 installed $ /usr/bin/pip3 install python-Levenshtein
Note that, as with the installation of the DoubleMetaphone above, the syntax of the call to pip may vary. The python-Levenshtein module provides far more functionality than just calculations of Levenshtein Distance.
The code below shows a test for Levenshtein Distance calculation in Python:
# demo.py import sys # pip install doublemetaphone # /usr/bin/pip3 install DoubleMetaphone from doublemetaphone import doublemetaphone #/usr/bin/pip3 install python-Levenshtein from Levenshtein import _levenshtein from Levenshtein._levenshtein import * def main(argv): testwords = ["There", "Their", "They're", "George", "Sally", "week", "weak", "phil", "fill", "Smith", "Schmidt"] for testword in testwords: print (testword + " - ", end="") print (doublemetaphone(testword)) print ("Testing Levenshtein Distance between XMT and SMT - " + str(distance('XMT', 'SMT'))) return 0 if __name__ == "__main__": main(sys.argv[1:]) Listing 2 - Demo extended to verify Levenshtein Distance calculation functionality
Executing this script gives the following output:
Figure 2 – Output of Levenshtein Distance test
The returned value of 1 indicates that there is one character between XMT and SMT that is different. In this case, it is the first character in both strings.
Comparing Double Metaphones in Python
What follows is not the be-all-and-end-all of phonetic comparisons. It is simply one of many ways to perform such a comparison. To effectively compare the phonetic nearness of any two given strings, then each Double Metaphone phonetic value of one string must be compared to the corresponding Double Metaphone phonetic value of another string. Since both phonetic values of a given string are given equal weight, then the average of these comparison values will give a reasonably good approximation of phonetic nearness:
PN = [ Dist(DM11, DM21,) + Dist(DM12, DM22,) ] / 2.0
Where:
- DM1(1): First Double Metaphone Value of String 1,
- DM1(2): Second Double Metaphone Value of String 1
- DM2(1): First Double Metaphone Value of String 2
- DM2(2): Second Double Metaphone Value of String 2
- PN: Phonetic Nearness, with lower values being nearer than higher values. A zero value indicates phonetic similarity. The highest value for this is the number of letters in the shortest string.
This formula breaks down in cases like Schmidt (XMT, SMT) and Smith (SM0, XMT) where the first phonetic value of the first string matches the second phonetic value of the second string. In such situations, both Schmidt and Smith can be considered to be phonetically similar because of the shared value. The code for the nearness function should apply the formula above only when all four phonetic values are different. The formula also has weaknesses when comparing strings of differing lengths.
Note, there is no singularly effective way to compare strings of differing lengths, even though calculating the Levenshtein Distance between two strings factors in differences in string length. A possible workaround would be to compare both strings up to the length of the shorter of the two strings.
Below is an example code snippet that implements the code above, along with some test samples:
# demo2.py import sys # pip install doublemetaphone # /usr/bin/pip3 install DoubleMetaphone from doublemetaphone import doublemetaphone #/usr/bin/pip3 install python-Levenshtein from Levenshtein import _levenshtein from Levenshtein._levenshtein import * def Nearness(string1, string2): dm1 = doublemetaphone(string1) dm2 = doublemetaphone(string2) nearness = 0.0 if dm1[0] == dm2[0] or dm1[1] == dm2[1] or dm1[0] == dm2[1] or dm1[1] == dm2[0]: nearness = 0.0 else: distance1 = distance(dm1[0], dm2[0]) distance2 = distance(dm1[1], dm2[1]) nearness = (distance1 + distance2) / 2.0 return nearness def main(argv): testwords = ["Philippe", "Phillip", "Sallie", "Sally", "week", "weak", "phil", "fill", "Smith", "Schmidt", "Harold", "Herald"] for testword in testwords: print (testword + " - ", end="") print (doublemetaphone(testword)) print ("Testing Levenshtein Distance between XMT and SMT - " + str(distance('XMT', 'SMT'))) print ("Distance between AK and AK - " + str(distance('AK', 'AK')) + "]") print ("Comparing week and weak - [" + str(Nearness("week", "weak")) + "]") print ("Comparing Harold and Herald - [" + str(Nearness("Harold", "Herald")) + "]") print ("Comparing Smith and Schmidt - [" + str(Nearness("Smith", "Schmidt")) + "]") print ("Comparing Philippe and Phillip - [" + str(Nearness("Philippe", "Phillip")) + "]") print ("Comparing Phil and Phillip - [" + str(Nearness("Phil", "Phillip")) + "]") print ("Comparing Robert and Joseph - [" + str(Nearness("Robert", "Joseph")) + "]") print ("Comparing Samuel and Elizabeth - [" + str(Nearness("Samuel", "Elizabeth")) + "]") return 0 if __name__ == "__main__": main(sys.argv[1:]) Listing 3 - Implementation of the Nearness Algorithm Above
The sample Python code gives the following output:
Figure 3 – Output of the Nearness Algorithm
The sample set confirms the general trend that the greater the differences in words, the higher the output of the Nearness function.
Read: File Handling in Python
Database Integration in Python
The code above breaches the functional gap between a given RDBMS and a Double Metaphone implementation. On top of this, by implementing the Nearness function in Python, it becomes easy to replace should a different comparison algorithm be preferred.
Consider the following MySQL/MariaDB table:
create table demo_names (record_id int not null auto_increment, lastname varchar(100) not null default '', firstname varchar(100) not null default '', primary key(record_id)); Listing 4 - MySQL/MariaDB CREATE TABLE statement
In most database-driven applications, the middleware composes SQL Statements for managing the data, including inserting it. The following code will insert some sample names into this table, but in practice, any code from a web or desktop application which collects such data could do the same thing.
# demo3.py import sys # pip install doublemetaphone # /usr/bin/pip3 install DoubleMetaphone from doublemetaphone import doublemetaphone #/usr/bin/pip3 install python-Levenshtein from Levenshtein import _levenshtein from Levenshtein._levenshtein import * # /usr/bin/pip3 install mysql.connector import mysql.connector def Nearness(string1, string2): dm1 = doublemetaphone(string1) dm2 = doublemetaphone(string2) nearness = 0.0 if dm1[0] == dm2[0] or dm1[1] == dm2[1] or dm1[0] == dm2[1] or dm1[1] == dm2[0]: nearness = 0.0 else: distance1 = distance(dm1[0], dm2[0]) distance2 = distance(dm1[1], dm2[1]) nearness = (distance1 + distance2) / 2.0 return nearness def main(argv): testNames = ["Smith, Jane", "Williams, Tim", "Adams, Richard", "Franks, Gertrude", "Smythe, Kim", "Daniels, Imogen", "Nguyen, Nancy", "Lopez, Regina", "Garcia, Roger", "Diaz, Catalina"] mydb = mysql.connector.connect( host="localhost", user="sound_demo_user", password="password1", database="sound_query_demo") for name in testNames: nameParts = name.split(',') # Normally one should do bounds checking here. firstname = nameParts[1].strip() lastname = nameParts[0].strip() sql = "insert into demo_names (lastname, firstname) values(%s, %s)" values = (lastname, firstname) insertCursor = mydb.cursor() insertCursor.execute (sql, values) mydb.commit() mydb.close() return 0 if __name__ == "__main__": main(sys.argv[1:]) Listing 5 - Inserting sample data into a database.
Running this code does not print anything, but it does populate the testing table in the database for the next listing to use. Querying the table directly in the MySQL client can verify that the code above worked:
Figure 4- The Inserted Table Data
The code below will feed some comparison data into the table data above and perform a nearness comparison against it:
# demo4.py import sys # pip install doublemetaphone # /usr/bin/pip3 install DoubleMetaphone from doublemetaphone import doublemetaphone #/usr/bin/pip3 install python-Levenshtein from Levenshtein import _levenshtein from Levenshtein._levenshtein import * # /usr/bin/pip3 install mysql.connector import mysql.connector def Nearness(string1, string2): dm1 = doublemetaphone(string1) dm2 = doublemetaphone(string2) nearness = 0.0 if dm1[0] == dm2[0] or dm1[1] == dm2[1] or dm1[0] == dm2[1] or dm1[1] == dm2[0]: nearness = 0.0 else: distance1 = distance(dm1[0], dm2[0]) distance2 = distance(dm1[1], dm2[1]) nearness = (distance1 + distance2) / 2.0 return nearness def main(argv): comparisonNames = ["Smith, John", "Willard, Tim", "Adamo, Franklin" ] mydb = mysql.connector.connect( host="localhost", user="sound_demo_user", password="password1", database="sound_query_demo") sql = "select lastname, firstname from demo_names order by lastname, firstname" cursor1 = mydb.cursor() cursor1.execute (sql) results1 = cursor1.fetchall() cursor1.close() mydb.close() for comparisonName in comparisonNames: nameParts = comparisonName.split(",") firstname = nameParts[1].strip() lastname = nameParts[0].strip() print ("Comparison for " + firstname + " " + lastname + ":") for result in results1: firstnameNearness = Nearness (firstname, result[1]) lastnameNearness = Nearness (lastname, result[0]) print ("\t[" + firstname + "] vs [" + result[1] + "] - " + str(firstnameNearness) + ", [" + lastname + "] vs [" + result[0] + "] - " + str(lastnameNearness)) return 0 if __name__ == "__main__": main(sys.argv[1:]) Listing 5 - Nearness Comparison Demo
Running this code gets us the output below:
Figure 5 – Results of the Nearness Comparison
At this point, it would be up to the developer to decide what the threshold would be for what constitutes a useful comparison. Some of the numbers above may seem unexpected or surprising, but one possible addition to the code might be an IF statement to filter out any comparison value that is greater than 2.
It may be worth noting that the phonetic values themselves are not stored in the database. This is because they are calculated as part of the Python code and there is not a real need to store these anywhere as they are discarded when the program exits, however, a developer may find value in storing these in the database and then implementing the comparison function within the database a stored procedure. However, the one major downside of this is a loss of code portability.
Read: Top Online Courses to Learn Python
Final Thoughts on Querying Data by Sound with Python
Comparing data by sound does not seem to get the “love” or attention that comparing a data by image analysis may get, but if an application has to deal with multiple similar-sounding variants of words in multiple languages, it can be a crucially useful tool. One useful feature of this type of analysis is that a developer need not be a linguistics or phonetic expert in order to make use of these tools. The developer also has great flexibility in defining how such data can be compared; the comparisons can be tweaked based on the application or business logic needs.
Hopefully, this field of study will get more attention in the research sphere and there will be more capable and robust analysis tools going forward.
Read more Python programming and software development tutorials.