Working with Large Data Collection and Performance


I am currently reimplementing how my project’s data collection. The project itself uses rails, and we collect a lot of data from APIs and then store them to the database. We are dealing with ten of thousands sets of data per couple minutes being processed. The challenges here is that there are a many associations with unique constraints on each model to represent each set of data.

To by pass the unique constraints, I have been using find_or_initialize_by the unique indices and then update but this result in many database calls. I have attempted to use upsert gem, but there are problems with hstore and array data and it also does not support associations and callbacks, which I end up dealing with lot of hash and array to rebuild the associations. I have also attempted to use manually written psql query to insert HABTM associations in a single database commit for a batch of associations.

All these attempts from above seems to result in code that is not maintainable, although some approaches might result in less database calls. I want gather your thoughts from the Rails community about how to maximize performance using the Rails way to deal with large amount of data upsert.


Why? What problem are you trying to solve?