Getting Uploaded Data Out of AWS S3
OK, so you know how to get data into AWS S3, what about getting it out? Previously, we uploaded entries from an imagined photo contest into a bucket. We sent a pair of files, a JSON file with the form data and the image. Let’s presume there’s a Rails app, it’s details don’t matter, but it has a model ContestEntry and we want to populate it from the S3 data. We’re going to write a script to do the import. When a script needs to load Rails, you do something like:
#!/usr/bin/env ruby
require File.expand_path('../../config/environment', __FILE__)
The exact path for config/environment will depend on where the
script is, in this case I’m presuming a subdirectory under
Rails.root
.
Loading Rails gives us the model. Now we need the S3 files. As before,
we use the aws-sdk
gem, which should be in your Gemfile.
I cover the basics of authenticating to S3 here. The code below assumes credentials are coming from the environment (or an AWS credential file in dev).
Getting our bucket is easy:
s3 = Aws::S3::Resource::new(region: 'us-west-1')
bucket = s3.bucket('bucket42')
As is getting the files (objects) in the bucket.
bucket.objects.each do |obj|
# Do something.
end
But from there it gets a little convoluted. The object is actually a
Aws::S3::ObjectSummary
which has meta data about the object and can
preform operations like moving, coping, or deleting the object, but
isn’t the S3 object itself. To fetch the actual object, you have to
call #get
on the ObjectSummary
:
object = obj.get
Once you have the actual object (really the Ruby object that wraps
HTTP calls that can access the S3 object), you can get it’s
data from #body
which is actually a StringIO
object. Confused?
Code brings clarity.
We’ll find all of the json object in the bucket:
json_files = bucket.objects.select {|o| o.key =~ /\.json$/}
Grab the first one, using get
to fetch the actual object:
file = json_files.first
s3_object = file.get
Then get it’s contents from body
with read
(since it’s an IO class
object):
json = s3_object.body.read
Finally, we parse that JSON and get a hash:
form_data = JSON.parse(json)
I find that interface a little funky, but Bam! now we have our form data which we can save in our model:
entry = ContestEntry.new(form_data)
(You’re going to validate that data and not accept it blindly, right?)
We saved the ObjectSummary
object so we could get the JSON file’s
name which is our original UUID:
uuid = File.basename(file.key,'.json')
And with that we can find the photo we uploaded:
photo_file = bucket.objects.detect {|o| o.key =~ /#{uuid}.*(?<!.json)$/}
Note the switch to detect
(There can be only one.) and the lovely
negated lookbehind regexp! Again when need to get the actual S3 object:
photo_object = photo_file.get
photo_object.content_type # => "image/jpeg"
Which we could save local with something like:
File.open(photo_file.key, 'w') {|f| f.write(photo_object.body.read) }
Or process it with CarrierWave or Paperclip or even leave it in S3 and serve it directly from there.
We have the data in our app and can do whatever it was we wanted with
it. All that remains is to somehow mark it the entries having been
processed, we don’t want to import it multiple times. The simplest way
to do that is to delete it, which can be done by calling #delete
on
the Aws::S3::ObjectSummary
object:
file.delete
photo_file.delete
If you’d rather backup the data instead of throwing it away, you can
rename the files with
#move_to
. The simplest way to us
move_to
is to pass in a string in the form of “target-bucket-name/target-key”:
file.move_to("completed-bucket/#{file.key}")
photo_file.move_to("completed-bucket/#{photo_file.key}")
If you don’t want to use a separate bucket, you could put the files in a “subfolder” instead:
file.move_to("#{file.bucket.name}/completed/#{file.key}")
But, keep in mind that “folders” in AWS are an illusion. They are
really just part of the file’s name. As a result, bucket.objects
will return all the files no matter how deeply nested. You can filter
by using the “prefix” option:
bucket.objects(prefix: 'completed') # => completed/*
With this approach you’d modify the form to upload with a prefix, say pending. In our original JavaScript, it would be as simple as changing:
var bucket = 'https://s3.amazonaws.com/bucket.example.com/';
to:
var bucket = 'https://s3.amazonaws.com/bucket.example.com/pending/';
and then filter the initial select
:
json_files = bucket.objects(prefix: 'pending').select {|o| o.key =~ /\.json$/}
And with that, I end the S3 upload series. You now have the tools to use S3 as a job queue. Use them wisely.
Comments