Wednesday, November 4, 2020

Using Soundex and Levenshtein-Distance in Python



In this post we will review usage of the Soundex and the Levenshtein-Distance algorithms to check words similarities. Our goal is to implement a method to decide if the a given word is similar to one of a list of a predefined well known words that we have.

For example, we could have the following list of predefined words:


predefined_words = [
"user",
"account",
"address",
"name",
"firstname",
"lastname",
"surname",
"credit",
"card",
"password",
"pass",
]


Given a new word such as "uzer", we would like to find if it match any of our predefined words.


Soundex

One method is to use the Soundex function. 

The Soundex function, creates a string of 4 characters representing the phonetic sound of the the word. The following code random words pairs, and prints the similar words:


import random
import string

import jellyfish


def match_soundex(str1, str2):
sound_ex1 = jellyfish.soundex(str1)
sound_ex2 = jellyfish.soundex(str2)
return sound_ex1 == sound_ex2


def random_word():
word_len = random.randint(4, 8)
word = ""
for _ in range(word_len):
word += random.choice(string.ascii_letters)
return word.lower()


for i in range(10000):
w1 = random_word()
w2 = random_word()
if match_soundex(w1, w2):
print(w1, w2)


and the output is:


ylqhso yloja
wpppw wbuihu
doyk dhyazgg
vvzbzpam vskpakt
gxtjh gxdzu
pgpeg pspqnug
xahbfhs xvex


Levenshtein Distance

Another method is the Levenshtein-Distance.

The Levenshtein-Distance calculates using dynamic programming, how many changes should be done in order to change one string into a second string.

The following code random words pairs, and prints the similar words. For our purpose, 2 word will be similar if less than 20% of the characters were changed.


import random
import string

import jellyfish


def match_levenshtein(str1, str2):
distance = jellyfish.levenshtein_distance(str1, str2)
min_len = min(len(str1), len(str2))
return distance / min_len < 0.2


def random_word():
word_len = random.randint(4, 8)
word = ""
for _ in range(word_len):
word += random.choice(string.ascii_letters)
return word.lower()


for i in range(10000):
w1 = random_word()
w2 = random_word()
if match_levenshtein(w1, w2):
print(w1, w2)



and the output is:

wyqg wxeo
khuqw kqosz
wvhy weve
yqspuzc ycpg
rgvo rkwgpo
nhgxbag njqvxk
woebbbkf wvkpfyf


The Lookup Implementation

Now we can use a combination of these two functions to look for similar words. if soundex or the levenshtein-distance return that we have a match, we will declare that the string if found.


for i in range(1000):
w1 = random_word()
for predefined in predefined_words:
if match_soundex(w1, predefined) or match_levenshtein(w1, predefined):
print(w1, predefined)


Final Note


In this post we have presented a method for finding words similarity.

Per my tests, the soundex function find similarities even for words that do not look similar. This is due to the default of using a 4 characters string to represent the sound of the word. The jellyfish python library (that we've used) does not allow changing the default length. For production usages, I recommend using a library that does allow the default change.

2 comments:

  1. Soundex is English only algorithm while levinstein distance is a cross language algorithm.

    ReplyDelete
  2. Correct, though you may find other variations, for example:

    Spanish:
    https://www.researchgate.net/publication/285589803_Comparison_of_a_Modified_Spanish_Phonetic_Soundex_and_Phonex_coding_functions_during_data_matching_process

    Russian:
    https://github.com/roddar92/russian_soundex

    ReplyDelete