Duplicate Record Detection

How are others doing duplicate record detection? I’m not finding very many solutions, or methods online. I found one called SimString, but not much else. I was wondering how others are detecting duplicates. Similar to suggested items, I would like to show records that may match or have similar attributes.

validate_uniqueness_of might be of help?

That would help slow the duplication, but if someone fills out a form and submits fname, lname, ss#, and they typo the ss# I would have a duplicate. I would like to display to admin users that this record has a related link, or is similar. Similar to how Google finds duplicates in your contacts and merges them.

"slow the duplication"?? No, insuring that the SSNs *are unique* via validations and unique indexes would prevent duplicates, period.

What is it about that you don't like?

I don't believe you're going to find a magic formula for what you're suggesting. The same thing could be said about last or first names as you are suggesting could happen with SSNs. What if somebody misspells Smith for Smit, for example? But worse yet, what if it is not a misspelling situation and the Smit is actually Smit? The same is true for SSNs, switching the last 2 digits does not mean it was a "misspell", it could just be that 2 different people have the same name and very similar SSNs. You have to draw a line somewhere, I think.

You could use auto-complete fields and then provide options based on records found using the 'LIKE' option in the where clause using the information currently being entered. That might help but I think you'll find it's not worth the effort.

It would stop duplication of that unique string, but not fix a typo like transposed numbers. Besides, that was a simple example, not meant to be challenged. It’s the process of detecting duplicates, I’m looking for. I know how to validate and key tables. Maybe another example is loading a million records from external source, and you need to find duplicates. I’m just asking if anyone has seen api’s or ruby utilities that preform this function. Like SimString it seems to compare how close to strings are to matching.

Yes, this is all very true. I was thinking if a comparison was done on multiple attributes that would help with just one name being wrong. I’m not looking for magic, just wondering how others find duplicated records. I could see this being used to detect data that links or is similar in nature.

Found this as well.


Whenever I have worked on similar projects in ended up being the customer's idea of what a "close approximation" was that made a possible duplicate. It was usually something like:   same birth date   same last name   same city (optional)   same state (optional)

Since there is not such a thing as a tried and true method for what a duplicate record is I believe you'll just need to do some manual work. My advise would be to ask your customer/boss for what the rules are.

Whenever I have worked on similar projects in ended up being the customer's idea of what a "close approximation" was that made a possible duplicate.

Exactly -- if they're not *identical* they're not "duplicates".

On the other hand if you define "similarity" to some degree you can use e.g. the Levenshtein gem to measure how "different" 2 given fields are.

Levenshtein.distance("Hassan Schroeder", "Hassan A. Schroeder")

=> 3


Yes, this is very nice.