In this post we will review usage of the Soundex and the Levenshtein-Distance algorithms to check words similarities. Our goal is to implement a method to decide if the a given word is similar to one of a list of a predefined well known words that we have.
For example, we could have the following list of predefined words:
predefined_words = [
"user",
"account",
"address",
"name",
"firstname",
"lastname",
"surname",
"credit",
"card",
"password",
"pass",
]
Given a new word such as "uzer", we would like to find if it match any of our predefined words.
Soundex
One method is to use the Soundex function.
The Soundex function, creates a string of 4 characters representing the phonetic sound of the the word. The following code random words pairs, and prints the similar words:
import random
import string
import jellyfish
def match_soundex(str1, str2):
sound_ex1 = jellyfish.soundex(str1)
sound_ex2 = jellyfish.soundex(str2)
return sound_ex1 == sound_ex2
def random_word():
word_len = random.randint(4, 8)
word = ""
for _ in range(word_len):
word += random.choice(string.ascii_letters)
return word.lower()
for i in range(10000):
w1 = random_word()
w2 = random_word()
if match_soundex(w1, w2):
print(w1, w2)
and the output is:
ylqhso yloja wpppw wbuihu doyk dhyazgg vvzbzpam vskpakt gxtjh gxdzu pgpeg pspqnug xahbfhs xvex
Levenshtein Distance
Another method is the Levenshtein-Distance.
The Levenshtein-Distance calculates using dynamic programming, how many changes should be done in order to change one string into a second string.
The following code random words pairs, and prints the similar words. For our purpose, 2 word will be similar if less than 20% of the characters were changed.
import random
import string
import jellyfish
def match_levenshtein(str1, str2):
distance = jellyfish.levenshtein_distance(str1, str2)
min_len = min(len(str1), len(str2))
return distance / min_len < 0.2
def random_word():
word_len = random.randint(4, 8)
word = ""
for _ in range(word_len):
word += random.choice(string.ascii_letters)
return word.lower()
for i in range(10000):
w1 = random_word()
w2 = random_word()
if match_levenshtein(w1, w2):
print(w1, w2)
and the output is:
wyqg wxeo khuqw kqosz wvhy weve yqspuzc ycpg rgvo rkwgpo nhgxbag njqvxk woebbbkf wvkpfyf
The Lookup Implementation
Now we can use a combination of these two functions to look for similar words. if soundex or the levenshtein-distance return that we have a match, we will declare that the string if found.
for i in range(1000):
w1 = random_word()
for predefined in predefined_words:
if match_soundex(w1, predefined) or match_levenshtein(w1, predefined):
print(w1, predefined)
Soundex is English only algorithm while levinstein distance is a cross language algorithm.
ReplyDeleteCorrect, though you may find other variations, for example:
ReplyDeleteSpanish:
https://www.researchgate.net/publication/285589803_Comparison_of_a_Modified_Spanish_Phonetic_Soundex_and_Phonex_coding_functions_during_data_matching_process
Russian:
https://github.com/roddar92/russian_soundex