Importing Legacy Data into Your Rails App
Sometimes you have to write a new Rails app for old data. One approach is to keep the old format and carefully craft your ActiveRecord models to work with the existing schema. But, odds are the old schema has become a mess overtime and wasn’t particularly Rails friendly to start with. When time and resources allow, I prefer a do-over.
That means Extracting, Transforming, and Loading (ETL) the old data. And for that, I like a Rake task. First you need to choose the correct gem to connect to your database, hopefully MySQL or PostgreSQL, and, if you aren’t already using it, add it to your Gemfile:
gem 'mysql2'
# or
gem 'pg'
Then you need to define the database connection. Since I already have
a process for managing the database.yml
file, I like to put it there.
legacy_production:
encoding: utf8
database: production
username: production
password: 123456
host: production-db.example.com
(The connection name isn’t important.) You can then pull it back with:
Rails.configuration.database_configuration['legacy_production']
This works well if you are using the mysql2 gem as it can directly pass the hash to it. The pg gem use a slightly different format, so you might need to tweak the config from the standard database.yml format.
For the sake of brevity, I’m going to use MySQL examples, PostgreSQL has a different syntax, but all the same functionality, see the docs.
Create a new file lib/tasks/import_legacy.rb
(or get all technical
and call it etl.rb
)
Require your gem at the top :
require 'mysql2' # or require 'pg'
and create a helper function to connect to the legacy production database at the bottom:
def legacy_database
@client ||= Mysql2::Client.new(Rails.configuration.database_configuration['legacy_production'])
end
Now we need to name space our tasks:
namespace(:db) do
namespace(:import_legacy) do
# Our code will go here
end
end
This names our tasks like db:import_legacy:whatever
. I like to
put them under db so it’s clear they mess with data, but call them
what makes sense to you and yours.
Now we need an import task:
desc "import legacy customers"
task :customers => :environment do
customers = legacy_database.query("SELECT * FROM customers")
name = Namae.parse(customer['name'])[0]
Customer.create!( id: customer['id'],
first_name: name.given,
last_name: name.family,
active: (customer['enabled'] == 'Y')
)
end
Let’s break this down.
desc "import legacy customers"
labels the task. It’s optional, but
without it, the task will not be listed when you run rake -T
.
task :customers => :environment do
defines a task
db:import_legacy:customer
. The => :environment
syntax says “call the
‘environment’ task before running. ‘environment’ is a build in task
that loads Rails. Omitting it makes tasks that don’t need Rails, say
clearing out temp files, run much faster, however we need it here to
access our new models.
customers = legacy_database.query("SELECT * FROM customers")
selects the
customer data from the legacy database. For PostgreSQL you would use
exec
instead of query
.
name = Namae.parse(customer['name'])[0]
is a sample transformation,
handling a common case were the legacy schema stored names in a single
column and we now want first and last
names. Namae is a gem that
does very smart parsing of names into the component parts, handling
things like titles, appellation, and suffixes.
Finally, I create the new customer. This is normal ActiveRecord stuff except that I’m passing in the id. Typically, you want to preserve the original id so that you don’t have to rebuild relationships. For example, our next task might import orders:
desc "import legacy orders"
task :orders => :environment do
orders = legacy_database.query("SELECT * FROM orders")
orders.each do |order|
if customer = Customer.find(order['customer_id'])
# munge order here
new_order = customer.orders.create(id: order['id'], # other fields)
order.items.each do |item|
# munge item
new_order.items.create(item['id'])
end
else
# handle missing customers.
end
end
end
This presumes that both the legacy and new schemas have the same sort of has_many/belongs_to relationships. In real life, you won’t be so lucky and the legacy schema will be much hairier. One to one matches of legacy tables to models are great when you have them, but more often than not you won’t.
Legacy data often has errors, which is why this code doesn’t assume it will always find a matching user. Anywhere you have a relationship, you’ll want to make sure it’s valid.
The above gives you a framework for “E” and “L” however, at the end of the day, the hard part is “T - transformation”. You’ll need to work the details out yourself, but hey, that’s why you get the big bucks.
Comments