Duplicate Record Detection

Justin_Stanczak · July 10, 2011, 10:21pm

How are others doing duplicate record detection? I’m not finding very many solutions, or methods online. I found one called SimString, but not much else. I was wondering how others are detecting duplicates. Similar to suggested items, I would like to show records that may match or have similar attributes.

pepe1 · July 11, 2011, 4:25pm

validate_uniqueness_of might be of help?

Justin_Stanczak · July 11, 2011, 4:42pm

That would help slow the duplication, but if someone fills out a form and submits fname, lname, ss#, and they typo the ss# I would have a duplicate. I would like to display to admin users that this record has a related link, or is similar. Similar to how Google finds duplicates in your contacts and merges them.

hassan · July 11, 2011, 8:02pm

"slow the duplication"?? No, insuring that the SSNs *are unique* via validations and unique indexes would prevent duplicates, period.

What is it about that you don't like?

pepe1 · July 11, 2011, 8:17pm

I don't believe you're going to find a magic formula for what you're suggesting. The same thing could be said about last or first names as you are suggesting could happen with SSNs. What if somebody misspells Smith for Smit, for example? But worse yet, what if it is not a misspelling situation and the Smit is actually Smit? The same is true for SSNs, switching the last 2 digits does not mean it was a "misspell", it could just be that 2 different people have the same name and very similar SSNs. You have to draw a line somewhere, I think.

You could use auto-complete fields and then provide options based on records found using the 'LIKE' option in the where clause using the information currently being entered. That might help but I think you'll find it's not worth the effort.

Justin_Stanczak · July 11, 2011, 8:17pm

It would stop duplication of that unique string, but not fix a typo like transposed numbers. Besides, that was a simple example, not meant to be challenged. It’s the process of detecting duplicates, I’m looking for. I know how to validate and key tables. Maybe another example is loading a million records from external source, and you need to find duplicates. I’m just asking if anyone has seen api’s or ruby utilities that preform this function. Like SimString it seems to compare how close to strings are to matching.

Justin_Stanczak · July 11, 2011, 8:29pm

Yes, this is all very true. I was thinking if a comparison was done on multiple attributes that would help with just one name being wrong. I’m not looking for magic, just wondering how others find duplicated records. I could see this being used to detect data that links or is similar in nature.

Found this as well.

http://en.wikipedia.org/wiki/User:Ipeirotis/Duplicate_Record_Detection

pepe1 · July 11, 2011, 8:56pm

Whenever I have worked on similar projects in ended up being the customer's idea of what a "close approximation" was that made a possible duplicate. It was usually something like: same birth date same last name same city (optional) same state (optional)

Since there is not such a thing as a tried and true method for what a duplicate record is I believe you'll just need to do some manual work. My advise would be to ask your customer/boss for what the rules are.

hassan · July 11, 2011, 9:04pm

Whenever I have worked on similar projects in ended up being the customer's idea of what a "close approximation" was that made a possible duplicate.

Exactly -- if they're not *identical* they're not "duplicates".

On the other hand if you define "similarity" to some degree you can use e.g. the Levenshtein gem to measure how "different" 2 given fields are.

Levenshtein.distance("Hassan Schroeder", "Hassan A. Schroeder")

=> 3

HTH!

Justin_Stanczak · July 11, 2011, 9:12pm

Yes, this is very nice.

Duplicate Record Detection

More Resources