Are you having fun with fuzzy when looking for duplicate records?
Thursday, 24th July 2014
Categories:
Data
Poor contact data that includes duplicate records costs companies and organisations thousands of pounds every year.
Duplicates records come from many sources that include:
- Poor data entry
- Merging data from different parts of the business process
- Mergers and Acquisitions over many years
- No single individual responsible for data integrity
- Lack of knowledge of how to deal with the problem
Exact duplicates are relatively easy to identify and deal with, but represent only a small number of possible duplicates in your data.
It’s the non-exact matching records you need to worry about and this is where fuzzy matching can help.
Fuzzy matching is at the heart of fuzzy deduplication. It uses advanced mathematical processes to determine similarity between data sets, where exact is not the goal, but possible matches can be identified. There are several different algorithms that can be used when dealing with contact names and addresses, these include:
- Phonetic Matching
- N-gram or Q-gram based algorithms
- Jaro-Winkler algorithm
- Containment, Frequency, Fast Near, Accurate Near, Frequency Near, Vowels Only, Consonants Only, Alphas Only and Numeric Only algorithms
Using some or all of these depends on what you need to achieve and how your data is at present.
The reason why you need to make sure your data accurate is not just costs associated with printing and mailing more than you need too, but operationally for example, accurate records mean that items are delivered to the right place and performance monitoring within the business is more reliable.
In addition customer relations are enhanced - it’s embarrassing when you send badly addressed items out, and insensitive when dealing with people who have recently died.
Author: Gamze Bilgili