Database cleaning

Although i'm using validations in my models, i've managed to get a few duplicate entries on my site. I've got roughly 300,000 entries, and my administration only allows long running rake tasks of up to 5 hours. I need to remove all entries that have the same :word and :language.

Is there an easy way to iterate through all entries, check to see if any other entries have the same :word and :language, and delete those from the DB?

Richard Schneeman wrote:

Although i'm using validations in my models, i've managed to get a few duplicate entries on my site. I've got roughly 300,000 entries, and my administration only allows long running rake tasks of up to 5 hours. I need to remove all entries that have the same :word and :language.

Is there an easy way to iterate through all entries, check to see if any other entries have the same :word and :language, and delete those from the DB?

Long term, I recommend adding a UNIQUE INDEX constraint to the database table (preferably via a migration) in addition to a validates_uniqueness_of in the model. The ActiveRecord validations are susceptible to race conditions, but the database constraint should not be.

Before you can add this index you'll need to remove the duplicates. There are a couple ways to do this, and the best choice depends on how many dupes you have within the entire dataset. In most cases there are relatively few dupes (say less than 1000 in your 300K), and here's my preference (performed in a rake task or a migration, or manually using the SQL tool of your choice, whichever way fits your workflow best):

1. Fetch the set of distinct tuples that are duplicated (word + language in your case). Something like this:

  select word, language from entries group by word, language having count(*) > 1

2. Once you know what your duplicates are, iterate over them and delete all but one of the records:

  dupes.each do |dupe|     # this algorithm retains only the most recently added dupe, modify as needed     # make sure you have an index on (word, language), not yet unique     entries = Entry.all(         :conditions => {             :word => dupe.word,             :language => dupe.language},         :order => 'id desc')

    entries.each_with_index do |e, index|       e.destroy if index > 0     end   end

3. Re-run the query and confirm that you don't have any lingering duplicates.

4. Add the UNIQUE INDEX constraint to prevent more dupes from being created. Make sure you have a validates_uniqueness_of in the model too.

Hope that helps!

Jeremy http://jeronrails.blogspot.com