Using Algorithms to Match Customer Data
by Steven Benitez • Partner / Software Engineer
Inconsistent data across systems—it’s a tale as old as IT.
Recently, we needed to correlate customer information between two systems to identify whether users in one system were likely to be the same person as their associated customer in a Customer Information System.
Importantly, we’re matching the names of people we know are either the same person, are a family member, or at least have some relationship with each other.
In a perfect world, we could match whether the names were an exact match, but unfortunately, things aren’t that easy. Data inconsistencies meant that some last names had suffixes placed in the last name field, so the last name on one side might be "Doe," while on the other side, it would be "Doe Jr." or "Doe III."
Further, people may have used a shortened form of their name or nickname in one system. Think "Clay" vs. "Clayton," "Bob" vs. "Robert," or "Steve" vs. "Steven."
We wound up sorting names into four categories:
- Exact or near exact matches — Both first and last names were exact matches when normalized.
- Close matches — Both first and last names were close matches using algorithms to identify similar names.
- Possible matches — Either the first or last name was a close match, but not both. These may represent family members or people who got married or divorced.
- Not matches — Neither name matched.
Exact or near exact names
For this category, we took the following steps to normalize names:
- Trim whitespace from the beginning and end of the name
- Remove diacritics (accents)
- Convert the name to lowercase
Any names that matched in this category were considered exact matches.
Close matches
For this category, we took the following steps to determine similar names:
- Generate the Soundex for both names
- Determine the Jaro-Winkler similarity score for both names.
Any names that were exact matches on Soundex or had a JW score of 80% or better were considered close matches.
Soundex
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.1 It converts a name into a letter and three digits. For example, the names "Steven" and "Stephen" both yield "S315".
Jaro-Winker Similarity
The Jaro–Winkler similarity is a string metric measuring an edit distance between two sequences.2 It gives more favorable ratings to strings that match from the beginning. The names "Steven" and "Stephen" yield a score of 0.894, or 89.4% similar.
Possible matches
Results in this category might be the same person, but would need to be reviewed by a person or have some other processing to be certain.
Not a match
Results in this category could be safely assumed to not be the same person.
Results
Luckily, the overwhelming majority of results fell into the "exact match" or "not a match" categories, leaving much smaller sets of "close matches" and "possible matches" to be reviewed. This turned an overwhelming data association effort into a much more manageable effort.