3 minute read

Sometimes you have to write a new Rails app for old data. One approach is to keep the old format and carefully craft your ActiveRecord models to work with the existing schema. But, odds are the old schema has become a mess overtime and wasn’t particularly Rails friendly to start with. When time and resources allow, I prefer a do-over.

That means Extracting, Transforming, and Loading (ETL) the old data. And for that, I like a Rake task. First you need to choose the correct gem to connect to your database, hopefully MySQL or PostgreSQL, and, if you aren’t already using it, add it to your Gemfile:

gem 'mysql2'

# or

gem 'pg'

Then you need to define the database connection. Since I already have a process for managing the database.yml file, I like to put it there.

legacy_production:
  encoding: utf8
  database: production
  username: production
  password: 123456
  host: production-db.example.com

(The connection name isn’t important.) You can then pull it back with:

Rails.configuration.database_configuration['legacy_production']

This works well if you are using the mysql2 gem as it can directly pass the hash to it. The pg gem use a slightly different format, so you might need to tweak the config from the standard database.yml format.

For the sake of brevity, I’m going to use MySQL examples, PostgreSQL has a different syntax, but all the same functionality, see the docs.

Create a new file lib/tasks/import_legacy.rb (or get all technical and call it etl.rb)

Require your gem at the top :

require 'mysql2' # or require 'pg'

and create a helper function to connect to the legacy production database at the bottom:

def legacy_database
  @client ||= Mysql2::Client.new(Rails.configuration.database_configuration['legacy_production'])
end

Now we need to name space our tasks:

namespace(:db) do
  namespace(:import_legacy) do
    # Our code will go here
  end
end

This names our tasks like db:import_legacy:whatever. I like to put them under db so it’s clear they mess with data, but call them what makes sense to you and yours.

Now we need an import task:

desc "import legacy customers"
task :customers => :environment do
  customers = legacy_database.query("SELECT * FROM customers")
  name = Namae.parse(customer['name'])[0]
  Customer.create!(        id: customer['id'],
              first_name: name.given,
               last_name: name.family,
                  active: (customer['enabled'] == 'Y')
              )
end

Let’s break this down.

desc "import legacy customers" labels the task. It’s optional, but without it, the task will not be listed when you run rake -T.

task :customers => :environment do defines a task db:import_legacy:customer. The => :environment syntax says “call the ‘environment’ task before running. ‘environment’ is a build in task that loads Rails. Omitting it makes tasks that don’t need Rails, say clearing out temp files, run much faster, however we need it here to access our new models.

customers = legacy_database.query("SELECT * FROM customers") selects the customer data from the legacy database. For PostgreSQL you would use exec instead of query.

name = Namae.parse(customer['name'])[0] is a sample transformation, handling a common case were the legacy schema stored names in a single column and we now want first and last names. Namae is a gem that does very smart parsing of names into the component parts, handling things like titles, appellation, and suffixes.

Finally, I create the new customer. This is normal ActiveRecord stuff except that I’m passing in the id. Typically, you want to preserve the original id so that you don’t have to rebuild relationships. For example, our next task might import orders:

desc "import legacy orders"
task :orders => :environment do
  orders = legacy_database.query("SELECT * FROM orders")
  orders.each do |order|
    if customer = Customer.find(order['customer_id'])
      # munge order here
      new_order = customer.orders.create(id: order['id'], # other fields)
      order.items.each do |item|
        # munge item
        new_order.items.create(item['id'])
      end
    else
      # handle missing customers.
    end
  end
end

This presumes that both the legacy and new schemas have the same sort of has_many/belongs_to relationships. In real life, you won’t be so lucky and the legacy schema will be much hairier. One to one matches of legacy tables to models are great when you have them, but more often than not you won’t.

Legacy data often has errors, which is why this code doesn’t assume it will always find a matching user. Anywhere you have a relationship, you’ll want to make sure it’s valid.

The above gives you a framework for “E” and “L” however, at the end of the day, the hard part is “T - transformation”. You’ll need to work the details out yourself, but hey, that’s why you get the big bucks.

Tags:

Updated:

Comments