see a puffin eat a fish

The hottest thing I found this week (and it’s only monday)

without comments

sad groucho marx bear

Lately i’ve been doing data imports with rails where I am running into a number of concurrency, race and timeout conditions. I have too many potential points of failure (network latency, database timeouts, etc.) that I cannot fully control but must account for. Worst case scenario I end up with missing or duplicate records when an import job dies or overlaps with another one.

The Rails way to address the data duplication problem is something like this in the controller:

validates_uniqueness_of :guid

This is a promise from Rails that it will not attempt to insert a record with the same guid as an existing record. What that really means is that if you are inserting 1,000 records, Rails will generate 1,000 selects to verify non-duplicate guids. That doesn’t really scale. I got around this by commenting out the model’s validation and creating a unique index straight up on the database.

add_index(:my_table, [:guid], :unique => true, :name => 'guids_must_be_unique')

Then instead of doing a select/insert cycle for each record to test for duplication (as the model would have done) I just do an insert and wrap it in a begin/rescue clause. If the insert fails because the unique index is violated, mysql throws an exception but execution continues:

Mysql::Error
or: Duplicate entry '17-24484' for key 2: INSERT INTO my_table ('guid') VALUES (24484)

ZoneTag Photo Saturday 4:07 pm 2/9/08

This reduces two calls to one (from 2000 selects/inserts to 1000 inserts) but doesn’t change the fact that 1000 inserts take a really long time. My scaling issues haven’t gone away. Enter the hottest thing I found this week, ar-extensions. It’s a bunch of really useful extensions to ActiveRecord which I have yet to fully explore but the one that has saved the day for me is the bulk import functionality. Ar-extensions lets me build up a giant array of arrays which it then proceeds to insert in bulk, as many as the underlying database will allow. My system went from 1,000 (or even 10,000 SQL calls) down to one. Routines that would have run all day finish in seconds. Let me show you have easy it is to create a preferences record for all your users (special thanks to this post for the helpful examples):

require 'ar_extensions'
# columns you want to import
@columns = [:guid, :contact_me_via_email]
# assuming a User model
@values = User.find(:all).map { |u| u.guid, u.contact_me_via_email }
# then just run an import
Preferences.import @columns, @values

Run the code, tail your log and watch the magic. What’s even more useful is the poorly documented :o n_duplicate_key_update option which translates into a bulk update (as opposed to insert).

Preferences.import @columns, @values, :o n_duplicate_key_update=>[:contact_me_via_email]

if an existing guid is found (assuming a unique index exists), a new record will not be inserted. Instead, the contact_me_via_email field will be updated.

Written by mb

February 11th, 2008 at 6:06 pm

Posted in Code

Leave a Reply

IMPORTANT! To be able to proceed, you need to solve the following simple math (so we know that you are a human) :-)

What is 5 + 9 ?
Please leave these two fields as-is: