In this post we will review usage of the Soundex and the Levenshtein-Distance algorithms to check words similarities. Our goal is to implement a method to decide if the a given word is similar to one of a list of a predefined well known words that we have.

For example, we could have the following list of predefined words:

predefined_words = [
    "user",
    "account",
    "address",
    "name",
    "firstname",
    "lastname",
    "surname",
    "credit",
    "card",
    "password",
    "pass",
]

Given a new word such as "uzer", we would like to find if it match any of our predefined words.

Soundex

One method is to use the Soundex function.

The Soundex function, creates a string of 4 characters representing the phonetic sound of the the word. The following code random words pairs, and prints the similar words:

import random
import string

import jellyfish


def match_soundex(str1, str2):
    sound_ex1 = jellyfish.soundex(str1)
    sound_ex2 = jellyfish.soundex(str2)
    return sound_ex1 == sound_ex2


def random_word():
    word_len = random.randint(4, 8)
    word = ""
    for _ in range(word_len):
        word += random.choice(string.ascii_letters)
    return word.lower()


for i in range(10000):
    w1 = random_word()
    w2 = random_word()
    if match_soundex(w1, w2):
        print(w1, w2)

and the output is:

ylqhso yloja
wpppw wbuihu
doyk dhyazgg
vvzbzpam vskpakt
gxtjh gxdzu
pgpeg pspqnug
xahbfhs xvex

Levenshtein Distance

Another method is the Levenshtein-Distance.

The Levenshtein-Distance calculates using dynamic programming, how many changes should be done in order to change one string into a second string.

The following code random words pairs, and prints the similar words. For our purpose, 2 word will be similar if less than 20% of the characters were changed.

import random
import string

import jellyfish


def match_levenshtein(str1, str2):
    distance = jellyfish.levenshtein_distance(str1, str2)
    min_len = min(len(str1), len(str2))
    return distance / min_len < 0.2


def random_word():
    word_len = random.randint(4, 8)
    word = ""
    for _ in range(word_len):
        word += random.choice(string.ascii_letters)
    return word.lower()


for i in range(10000):
    w1 = random_word()
    w2 = random_word()
    if match_levenshtein(w1, w2):
        print(w1, w2)

and the output is:

wyqg wxeo
khuqw kqosz
wvhy weve
yqspuzc ycpg
rgvo rkwgpo
nhgxbag njqvxk
woebbbkf wvkpfyf

The Lookup Implementation

Now we can use a combination of these two functions to look for similar words. if soundex or the levenshtein-distance return that we have a match, we will declare that the string if found.

for i in range(1000):
    w1 = random_word()
    for predefined in predefined_words:
        if match_soundex(w1, predefined) or match_levenshtein(w1, predefined):
            print(w1, predefined)

Final Note

In this post we have presented a method for finding words similarity.

Per my tests, the soundex function find similarities even for words that do not look similar. This is due to the default of using a 4 characters string to represent the sound of the word. The jellyfish python library (that we've used) does not allow changing the default length. For production usages, I recommend using a library that does allow the default change.

2 comments:

baldermanNovember 4, 2020 at 12:04 PM
Soundex is English only algorithm while levinstein distance is a cross language algorithm.
AlonNovember 4, 2020 at 8:26 PM
Correct, though you may find other variations, for example:

Spanish:
https://www.researchgate.net/publication/285589803_Comparison_of_a_Modified_Spanish_Phonetic_Soundex_and_Phonex_coding_functions_during_data_matching_process

Russian:
https://github.com/roddar92/russian_soundex

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Wednesday, November 4, 2020

Using Soundex and Levenshtein-Distance in Python

Soundex

Levenshtein Distance

The Lookup Implementation

Final Note

2 comments: