Importing Data to Database in Rails 50 Times Faster Than Normal
Have you ever needed to save lots of records with unique attributes to a database using ActiveRecord? Have you noticed how time consuming such a way of Ruby import is? I would like to show you how to do it quicker and with less effort. Let’s go!
Have you ever needed to save lots of records with unique attribute to database using ActiveRecord? Have you noticed how time consuming such task is? I would like to show you how to do it quickly.
I've prepared simple users table with columns: first_name
, last_name
, uid. The uid
value needs to be unique.
class CreateUsers < ActiveRecord::Migration
def change
create_table :users do |t|
t.string :first_name
t.string :last_name
t.integer :uid
end
add_index :users, [:uid], unique: true
end
end
In User
model I added just one validation:
class User < ActiveRecord::Base
validates :uid, uniqueness: true
end
I will test importing data from the CSV file. I’ve prepared the file with 100,000 records. There are 10,000 records from the file in database already, just to make things more realistic. A small preview of the data:
...
Nicholas,Bates,252
Leona,Ryan,253
Mark,Bass,254
Eliza,Keller,255
Scott,Lloyd,256
Rosalie,McCormick,257
Derrick,Armstrong,258
Gilbert,Green,259
Olivia,Daniels,260
Ella,Newton,261
Devin,Vaughn,262
Cody,Newton,263
...
For measuring purposes I will use standard Ruby module called Benchmark. I will measure only time of saving records to database, without processing CSV file. Each method will be wrapped in benchmark service:
require 'csv'
class SaveMethodBenchmark < BaseService
CSV_FILE = "#{Rails.root}/tmp/users.csv".freeze
def call
# load users data
users = CSV.read(CSV_FILE)
# clear database and save 10% of users
PrepareDatabase.call(users)
# measure time of records saving
puts Benchmark.measure { User.save_method(users) }
end
end
WARNING!
In my examples I only want to measure time of saving data into database. I don’t want to show how to import data from CSV file to database. I use CSV files as a simple source of data for my examples.
Be careful, load all data from csv file at once by CSV.read(CSV_FILE)
method can be dangerous, your server would run out of memory if the file is too big. You should load data in small chunks.
1. The basic method
In basic method I use pure ActiveRecord
. I load all the data from the file, convert it to array of hashes:
[
{ first_name: Lida, last_name: Atkins, uid: 1 },
{ first_name: Jorge, last_name: Carson, uid: 2 },
...
]
...and pass the array to User.create
method.
class User < ActiveRecord::Base
class << self
def basic_method(users)
create users
end
end
end
For this method I received the following result:
basic_method 210,824696s real time
210,82s
- Is it a lot or not? It’s more than 3 minutes. Can we do it faster? Let's look closely at how this method works. For each record we get following queries:
(0.2ms) BEGIN
User Exists (3.6ms) SELECT 1 AS one FROM "users" WHERE "users"."uid" = 1 LIMIT 1
SQL (7.7ms) INSERT INTO "users" ("first_name", "last_name", "uid") VALUES ($1, $2, $3) RETURNING "id" [["first_name", "Lida"], ["last_name", "Atkins"], ["uid", 1]]
(6.4ms) COMMIT
The SQL code above exposes 3 main disadvantages of the basic method:
We create separate SQL transaction for each record -
BEGIN … COMMIT
block. It means that Rails needs to connect to database for each record, so we end up with 100,000 connections.We want to check if each user has unique uuid, so we execute
SELECT
query for each record, it gives us 100,000 queries.If the uuid of user is unique we saves the record into database with next query
INSERT INTO
, there are next 100,000 queries.100,000 SQL transactions and 200,000 SQL queries - this certainly affects the importing time of our data.
2. The basic method wrapped in one transaction
First we will try to decrease number of transactions. We will wrap all of our queries in one transaction.
def basic_method_with_transaction(users)
transaction do
create users
end
end
Now we have 148,876s
. It’s about 1 min faster than our first attempt. A small step forward...
basic_method_with_transaction 148,876361s real time
Now we can see that we still have a lot of queries but we do not commit the transaction after each record:
(0.3ms) BEGIN
User Exists (0.4ms) SELECT 1 AS one FROM "users" WHERE "users"."uid" = 1 LIMIT 1
SQL (6.9ms) INSERT INTO "users" ("first_name", "last_name", "uid") VALUES ($1, $2, $3) RETURNING "id" [["first_name", "Lida"], ["last_name", "Atkins"], ["uid", 1]]
User Exists (0.5ms) SELECT 1 AS one FROM "users" WHERE "users"."uid" = 2 LIMIT 1
SQL (0.3ms) INSERT INTO "users" ("first_name", "last_name", "uid") VALUES ($1, $2, $3) RETURNING "id" [["first_name", "Jorge"], ["last_name", "Carson"], ["uid", 2]]
...
3. Bulk insert
Let’s decrease the number of SQL queries. In SQL we can pass multiple values to INSERT INTO
query, why then ActiveRecord (in our example) creates a new query for each value? Unfortunately I don’t know why, but if we want to save all of our data in one query we need to use extra gem activerecord-import. It adds import method to our models which can take an array of column names and an array of arrays (array with our data).
def import_users(users)
transaction do
import([:first_name, :last_name, :uid], users)
end
end
This method was performed by 80,004s
. It’s two times faster than previous attempt. It looks impressive, it seems that the number of queries is the primary cause of a long runtime of previous methods.
import_users 80,004497s real time
Below you can see an example of SQL code for two records generated by our method. We can see that all queries are wrapped by one transaction, for each record we have SELECT
query which checks if uid
is unique and we have one INSERT INTO
query with all records at the end.
(0.1ms) BEGIN
User Exists (11.7ms) SELECT 1 AS one FROM "users" WHERE "users"."uid" = 1 LIMIT 1
User Exists (0.3ms) SELECT 1 AS one FROM "users" WHERE "users"."uid" = 2 LIMIT 1
Class Create Many Without Validations Or Callbacks (5.8ms) INSERT INTO "users" ("first_name","last_name","uid") VALUES ('Lida','Atkins',1),('Jorge','Carson',2)RETURNING id
(5.1ms) COMMIT
4. Bulk insert with validation on database side
There is one more thing which we can change - decrease the number of SELECT
queries. They check if users’ uuids are unique. I have one idea how we can do away with tons of SELECT
queries. We can omit rails validations which generate those queries and can use SQL unique index instead. Remember the CreateUsers
migration? Particularly the following line:
add_index :users, [:uid], unique: true
This method is based on this line. Recently PostgreSQL provided INSERT ... ON CONFLICT
functionality (only available in 9.5 and higher). It allows us to define what should be done when the record is not unique. With activerecord-import we will ignore duplicated keys.
def import_users_without_validations(users)
transaction do
import(COLUMNS, users, validate: false, on_duplicate_key_ignore: true)
end
end
Yes, we got it! We've saved all records during the 4.213s
. It’s twenty times faster than the previous method, it’s fifty times faster than the first method. I’m proud of it.
import_users_with_sql_validation 4.212681s real time
In the simplest case when we have two users to save and one of them doesn’t have unique uuid, generated SQL code would look like below. We have only one INSERT INTO
query, that’s all. It’s amazing that in all cases there would be 1 SQL query and 1 transactions, no matter how many records we would save, in first method there was 200,000 queries and 100,000 transactions.
(0.7ms) BEGIN
Class Create Many Without Validations Or Callbacks (0.7ms) INSERT INTO "users" ("first_name","last_name","uid") VALUES ('Lida','Atkins',1),('Jorge','Carson',2) ON CONFLICT DO NOTHING RETURNING id
(2.0ms) COMMIT
CONCLUSION
I know Ruby on Rails is a high level web framework and I love Rails for that. But it’s important to know how things work at deep level. When we learn how Rails and ActiveRecord magic works we can speed up the import of records to database a lot.
Ok, but my engineerial soul has one more question: Will the last method always be the best choice? What if we have less than 100,000 records? I checked it:
10 | 100 | 1000 | 10000 | 100000 | |
---|---|---|---|---|---|
basic method | 0,099 s | 0,277 s | 2,614 s | 23,608 s | 210,825 s |
basic method with transaction | 0,022 s | 0,207 s | 1,907 s | 14,326 s | 148,876 s |
import users | 0,073 s | 0,160 s | 1,200 s | 7,443 s | 80,004 s |
import users with sql validations | 0,002 s | 0,005 s | 0,0480 s | 0,376 s | 4,213 s |
Table and chart show that the best way is using ActiveRecord::Import
with validations on database side.
I know it’s a really simple case, without any other validations, but you can treat it as a hint how to use SQL validations to speed up saving data to database.
Let's talk about Jamstack and headless e-commerce!
Contact us and we'll warmly introduce you to the vast world of Jamstack & headless development!