Importing Data to Database in Rails 50 Times Faster Than Normal

Have you ever needed to save lots of records with unique attributes to a database using ActiveRecord? Have you noticed how time consuming such a way of Ruby import is? I would like to show you how to do it quicker and with less effort. Let’s go!

Portrait of a man with short hair and a beard, wearing a dark blazer and light blue shirt, smiling slightly against a plain background.

Mikolaj Grygiel

November 21, 2016 • 8 min read

A wide highway at sunset with blurred car lights, flanked by modern buildings and trees under a colorful sky.

Have you ever needed to save lots of records with unique attribute to database using ActiveRecord? Have you noticed how time consuming such task is? I would like to show you how to do it quickly.

I've prepared simple users table with columns: first_name, last_name, uid. The uid value needs to be unique.

1class CreateUsers < ActiveRecord::Migration
2  def change
3    create_table :users do |t|
4      t.string :first_name
5      t.string :last_name
6      t.integer :uid
7    end
8    add_index :users, [:uid], unique: true
9  end
10end

tsx

In User model I added just one validation:

1class User < ActiveRecord::Base
2  validates :uid, uniqueness: true
3end

tsx

I will test importing data from the CSV file. I’ve prepared the file with 100,000 records. There are 10,000 records from the file in database already, just to make things more realistic. A small preview of the data:

1...
2Nicholas,Bates,252
3Leona,Ryan,253
4Mark,Bass,254
5Eliza,Keller,255
6Scott,Lloyd,256
7Rosalie,McCormick,257
8Derrick,Armstrong,258
9Gilbert,Green,259
10Olivia,Daniels,260
11Ella,Newton,261
12Devin,Vaughn,262
13Cody,Newton,263
14...

tsx

For measuring purposes I will use standard Ruby module called Benchmark. I will measure only time of saving records to database, without processing CSV file. Each method will be wrapped in benchmark service:

1require 'csv'
2
3class SaveMethodBenchmark < BaseService
4  CSV_FILE = "#{Rails.root}/tmp/users.csv".freeze
5
6
7  def call
8    # load users data
9    users = CSV.read(CSV_FILE)
10
11
12    # clear database and save 10% of users
13    PrepareDatabase.call(users)
14
15
16    # measure time of records saving
17    puts Benchmark.measure { User.save_method(users) }
18  end
19end

tsx

WARNING!

In my examples I only want to measure time of saving data into database. I don’t want to show how to import data from CSV file to database. I use CSV files as a simple source of data for my examples.

Be careful, load all data from csv file at once by CSV.read(CSV_FILE) method can be dangerous, your server would run out of memory if the file is too big. You should load data in small chunks.

1. The basic method

In basic method I use pure ActiveRecord. I load all the data from the file, convert it to array of hashes:

1  [
2    { first_name: Lida, last_name: Atkins, uid: 1 },
3    { first_name: Jorge, last_name: Carson, uid: 2 },
4    ...
5  ]

tsx

...and pass the array to User.create method.

1  class User < ActiveRecord::Base
2    class << self
3      def basic_method(users)
4        create users
5      end
6    end
7  end

tsx

For this method I received the following result:

1  basic_method  210,824696s real time

tsx

210,82s - Is it a lot or not? It’s more than 3 minutes. Can we do it faster? Let's look closely at how this method works. For each record we get following queries:

1    (0.2ms)  BEGIN
2  User Exists (3.6ms)  SELECT  1 AS one FROM "users" WHERE "users"."uid" = 1 LIMIT 1
3  SQL (7.7ms)  INSERT INTO "users" ("first_name", "last_name", "uid") VALUES ($1, $2, $3) RETURNING "id"  [["first_name", "Lida"], ["last_name", "Atkins"], ["uid", 1]]
4     (6.4ms)  COMMIT

tsx

The SQL code above exposes 3 main disadvantages of the basic method:

We create separate SQL transaction for each record - BEGIN … COMMIT block. It means that Rails needs to connect to database for each record, so we end up with 100,000 connections.
We want to check if each user has unique uuid, so we execute SELECT query for each record, it gives us 100,000 queries.
If the uuid of user is unique we saves the record into database with next query INSERT INTO, there are next 100,000 queries.
100,000 SQL transactions and 200,000 SQL queries - this certainly affects the importing time of our data.

2. The basic method wrapped in one transaction

First we will try to decrease number of transactions. We will wrap all of our queries in one transaction.

1  def basic_method_with_transaction(users)
2    transaction do
3      create users
4    end
5  end

tsx

Now we have 148,876s. It’s about 1 min faster than our first attempt. A small step forward...

1  basic_method_with_transaction  148,876361s real time

tsx

Now we can see that we still have a lot of queries but we do not commit the transaction after each record:

1    (0.3ms)  BEGIN
2  User Exists (0.4ms)  SELECT  1 AS one FROM "users" WHERE "users"."uid" = 1 LIMIT 1
3  SQL (6.9ms)  INSERT INTO "users" ("first_name", "last_name", "uid") VALUES ($1, $2, $3) RETURNING "id"  [["first_name", "Lida"], ["last_name", "Atkins"], ["uid", 1]]
4
5  User Exists (0.5ms)  SELECT  1 AS one FROM "users" WHERE "users"."uid" = 2 LIMIT 1
6  SQL (0.3ms)  INSERT INTO "users" ("first_name", "last_name", "uid") VALUES ($1, $2, $3) RETURNING "id"  [["first_name", "Jorge"], ["last_name", "Carson"], ["uid", 2]]
7
8
9  ...

tsx

3. Bulk insert

Let’s decrease the number of SQL queries. In SQL we can pass multiple values to INSERT INTO query, why then ActiveRecord (in our example) creates a new query for each value? Unfortunately I don’t know why, but if we want to save all of our data in one query we need to use extra gem activerecord-import. It adds import method to our models which can take an array of column names and an array of arrays (array with our data).

1  def import_users(users)
2    transaction do
3      import([:first_name, :last_name, :uid], users)
4    end
5  end

tsx

This method was performed by 80,004s. It’s two times faster than previous attempt. It looks impressive, it seems that the number of queries is the primary cause of a long runtime of previous methods.

1  import_users  80,004497s real time

tsx

Below you can see an example of SQL code for two records generated by our method. We can see that all queries are wrapped by one transaction, for each record we have SELECT query which checks if uid is unique and we have one INSERT INTO query with all records at the end.

1    (0.1ms)  BEGIN
2  User Exists (11.7ms)  SELECT  1 AS one FROM "users" WHERE "users"."uid" = 1 LIMIT 1
3  User Exists (0.3ms)  SELECT  1 AS one FROM "users" WHERE "users"."uid" = 2 LIMIT 1
4  Class Create Many Without Validations Or Callbacks (5.8ms)  INSERT INTO "users" ("first_name","last_name","uid") VALUES ('Lida','Atkins',1),('Jorge','Carson',2)RETURNING id
5     (5.1ms)  COMMIT

tsx

4. Bulk insert with validation on database side

There is one more thing which we can change - decrease the number of SELECT queries. They check if users’ uuids are unique. I have one idea how we can do away with tons of SELECT queries. We can omit rails validations which generate those queries and can use SQL unique index instead. Remember the CreateUsers migration? Particularly the following line:

1  add_index :users, [:uid], unique: true

tsx

This method is based on this line. Recently PostgreSQL provided INSERT ... ON CONFLICT functionality (only available in 9.5 and higher). It allows us to define what should be done when the record is not unique. With activerecord-import we will ignore duplicated keys.

1  def import_users_without_validations(users)
2    transaction do
3      import(COLUMNS, users, validate: false, on_duplicate_key_ignore: true)
4    end
5  end

tsx

Yes, we got it! We've saved all records during the 4.213s. It’s twenty times faster than the previous method, it’s fifty times faster than the first method. I’m proud of it.

1  import_users_with_sql_validation  4.212681s real time

tsx

In the simplest case when we have two users to save and one of them doesn’t have unique uuid, generated SQL code would look like below. We have only one INSERT INTO query, that’s all. It’s amazing that in all cases there would be 1 SQL query and 1 transactions, no matter how many records we would save, in first method there was 200,000 queries and 100,000 transactions.

1    (0.7ms)  BEGIN
2  Class Create Many Without Validations Or Callbacks (0.7ms)  INSERT INTO "users" ("first_name","last_name","uid") VALUES ('Lida','Atkins',1),('Jorge','Carson',2) ON CONFLICT DO NOTHING RETURNING id
3     (2.0ms)  COMMIT

tsx

Conclusion

I know Ruby on Rails is a high level web framework and I love Rails for that. But it’s important to know how things work at deep level. When we learn how Rails and ActiveRecord magic works we can speed up the import of records to database a lot.

Ok, but my engineerial soul has one more question: Will the last method always be the best choice? What if we have less than 100,000 records? I checked it:

	10	100	1000	10000	100000
basic method	0,099 s	0,277 s	2,614 s	23,608 s	210,825 s
basic method with transaction	0,022 s	0,207 s	1,907 s	14,326 s	148,876 s
import users	0,073 s	0,160 s	1,200 s	7,443 s	80,004 s
import users with sql validations	0,002 s	0,005 s	0,0480 s	0,376 s	4,213 s

Line graph showing data saving time for different methods. X-axis: number of records (0 to 100,000). Y-axis: time in seconds (0 to 250).

Table and chart show that the best way is using ActiveRecord::Import with validations on database side.

I know it’s a really simple case, without any other validations, but you can treat it as a hint how to use SQL validations to speed up saving data to database.

Web Development

Let’s Create a Great Website Together

We'll shape your web platform the way you win it!