Using Algorithms to Match Customer Data

May 16, 2025

by Steven Benitez • Partner / Software Engineer

Inconsistent data across systems—it’s a tale as old as IT.

Recently, we needed to correlate customer information between two systems to identify whether users in one system were likely to be the same person as their associated customer in a Customer Information System.

Importantly, we’re matching the names of people we know are either the same person, are a family member, or at least have some relationship with each other.

In a perfect world, we could match whether the names were an exact match, but unfortunately, things aren’t that easy. Data inconsistencies meant that some last names had suffixes placed in the last name field, so the last name on one side might be "Doe," while on the other side, it would be "Doe Jr." or "Doe III."

Further, people may have used a shortened form of their name or nickname in one system. Think "Clay" vs. "Clayton," "Bob" vs. "Robert," or "Steve" vs. "Steven."

We wound up sorting names into four categories:

Exact or near exact matches — Both first and last names were exact matches when normalized.
Close matches — Both first and last names were close matches using algorithms to identify similar names.
Possible matches — Either the first or last name was a close match, but not both. These may represent family members or people who got married or divorced.
Not matches — Neither name matched.

Exact or near exact names

For this category, we took the following steps to normalize names:

Trim whitespace from the beginning and end of the name
Remove diacritics (accents)
Convert the name to lowercase

Any names that matched in this category were considered exact matches.

Close matches

For this category, we took the following steps to determine similar names:

Generate the Soundex for both names
Determine the Jaro-Winkler similarity score for both names.

Any names that were exact matches on Soundex or had a JW score of 80% or better were considered close matches.

Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.¹ It converts a name into a letter and three digits. For example, the names "Steven" and "Stephen" both yield "S315".

Jaro-Winker Similarity

The Jaro–Winkler similarity is a string metric measuring an edit distance between two sequences.² It gives more favorable ratings to strings that match from the beginning. The names "Steven" and "Stephen" yield a score of 0.894, or 89.4% similar.

Possible matches

Results in this category might be the same person, but would need to be reviewed by a person or have some other processing to be certain.

Not a match

Results in this category could be safely assumed to not be the same person.

Results

Luckily, the overwhelming majority of results fell into the "exact match" or "not a match" categories, leaving much smaller sets of "close matches" and "possible matches" to be reviewed. This turned an overwhelming data association effort into a much more manageable effort.

Using Algorithms to Match Customer Data

Exact or near exact names

Close matches

Soundex

Jaro-Winker Similarity

Possible matches

Not a match

Results

References

Let’s work together

Exact or near exact names

Close matches

Soundex

Jaro-Winker Similarity

Possible matches

Not a match

Results

References

Footnotes

Let’s work together