Small Datum: February 2016

Monday, February 22, 2016

Concurrent transaction performance in RocksDB

Support for transactions was added to RocksDB last year. Here I explain performance for the pessimistic transaction API with a concurrent workload. I used the db_bench client rather than MySQL but MyRocks reuses the RocksDB transaction API and gets the performance benefits. For pessimistic transactions:

there is not much overhead from the transaction API
throughput improves as the work done per transaction increases
throughput improves with the concurrent memtable

Configuration

The test server has 2 sockets, 24 CPU cores and 48 HW threads. The database used fast storage (tmpfs). The benchmark is db_bench --benchmarks=randomtransaction via this test script. It was run was run for 1 to 48 concurrent clients to understand the impact of concurrency. It was run without transactions and with pessimistic transactions to understand the impact of the transaction API. It was run with --batch_size=1 and --batch_size=4 to understand the impact of doing more work per transaction. The test database was cached by RocksDB.

Results

The first two graphs show the transaction throughput with (concurMT=1) and without (concurMT=0) the concurrent memtable enabled. Throughput is much larger with the concurrent memtable enabled. The benefit is larger with batch_size=4 than batch_size=1 because there is more work done with a larger batch_size and more mutex contention to avoid. Throughput is larger with batch_size=1 because there is 4X more work done per transaction with batch_size=4.

The next two graphs show the efficiency of the transaction API. It compares the ratio of the throughput with pessimistic transactions versus the throughput without transactions. When the value is 1.0 then the throughput with transactions matches the throughput without transactions. From the graphs below the efficiency is better with batch_size=1 than with batch_size=4 and the efficiency improves with concurrency.

Data for the graphs is here:

Wednesday, February 17, 2016

Less slow versus faster

I describe my work as making things less slow rather than making things faster. While making something less slow tends to make it faster I think there are two different tasks here and both are important:

By making things faster I mean a faster response time and more throughput for a single-threaded workload. For example, cache-friendly memory allocation might increase throughput from 100 QPS to 125 QPS and decrease response time from 10ms to 8ms.
By making things less slow I mean that the throughput gap - the difference between linear scaling and actual throughput - has been reduced. Assume the server can do 100 QPS at one thread and with linear scaling it can do 1000 QPS at 10 threads. But linear scaling is hard to achieve and the server might be limited to 400 QPS on a real workload. Eventually someone has time for performance debugging and finds things to make better with PMP and we get 500 QPS at 10 threads. This is making something less slow.

The USL provides a formal model to reason about making software less slow for concurrent workloads. I learned about Dr. Gunther's work on the USL thanks to Baron.

Friday, February 12, 2016

RocksDB, InnoDB and TokuDB vs sysbench

This compares RocksDB and InnoDB storage engines using sysbench and extends the results I shared yesterday. I continue to get great performance from MyRocks and this is a bonus as my motivation for using it is better storage efficiency.

tl;dr - MyRocks is usually equal to or faster than InnoDB on these tests

Configuration

The test server for InnoDB and RocksDB has 2 sockets, 24 cores (48 HW threads), 256G of RAM and a fast SSD. I also ran tests for TokuDB on an older server, but the HW is too much slower so I won't share the results here. From past experience, TokuDB does poorly on these tests.

For all tests I disabled the binlog and sync-on-commit. The my.cnf files were similar to my previous post but included extra settings inherited from production.

I used my sysbench branch in three configurations:

single-threaded & CPU-bound - engines used a buffer pool large enough to cache the database. Writes were fast thanks to the fast SSD. Tests were run for one database client, sysbench process & database table. The table had 10M rows.
multi-threaded & CPU-bound - engines used a buffer pool large enough to cache the database. Writes were fast thanks to the fast SSD. Tests were run for 8, 16, 24, 32, 40 and 48 concurrent clients. Tests used 8 sysbench processes and 8 tables (table per process). Each table had 10M rows.
multi-threaded & IO-bound - engines used a 1G buffer pool and buffered IO so reads were fast from the OS page cache. Writes were fast thanks to the fast SSD. Tests were run for 8, 16, 24, 32, 40 and 48 concurrent clients. Tests used 8 sysbench processes and 8 tables (table per process). Each table had 10M rows.

Four tests were run for each configuration:

point-query - fetch all columns of one row by primary key lookup (command line). Performance is measured as queries per second (QPS) and each query is an auto-commit SELECT statement.
read-only - the sysbench read only transaction (command line). Performance is measured as transactions per second (TPS).
read-write - the sysbench read write transaction (command line). Performance is measured as transactions per second (TPS).
increment - increment a non-indexed column of one row by primary key lookup (command line). Performance is measured as queries per second (QPS) and each query is an auto-commit UPDATE statement. The BEGIN and COMMIT statements are not used.

Finally, I tested the following binaries:

My56.Rocks - Facebook MySQL 5.6, RocksDB engine with zlib compression
My56.Inno - MySQL 5.6.29, InnoDB
My56.Inno.zlib - MySQL 5.6.29, InnoDB with zlib compression
My57.Inno - MySQL 5.7.10, InnoDB
My57.Inno.zlib - MySQL 5.7.10, InnoDB with zlib compression

Single-threaded & CPU-bound

The first graph is for QPS on the point-query and increment tests. The second graph is for TPS on the read-only and read-write tests. Data for the graphs is here.

Performance summary:

RocksDB is comparable with InnoDB for point queries, increment and read-only.
RocksDB is comparable with uncompressed InnoDB for read-write. Compressed InnoDB is slower for read-write and I didn't try to explain it.
Compressed InnoDB matches uncompressed InnoDB on the point-query test because compression code isn't hit during the test as the buffer pool is large enough to cache all pages as uncompressed.
Compressed InnoDB is slower on increment because some (de)compression operations are done in the foreground and increase response time.

Multi-threaded & CPU-bound

Tests were run for 8 to 48 concurrent clients using 8 sysbench processes and a different table per process. Data for the graphs is here.

Performance summary:

All engines were great on the point-query and read-only workloads but MySQL 5.6 is faster than 5.7. The difference for point-query is ~12% at 8 clients and ~4% at 48 clients. The difference for read-only is ~12% at 8 clients and ~10% at 48 clients.
InnoDB in MySQL 5.6 suffers on the increment workload. I did not debug it. InnoDB in MySQL 5.7 was ~15% faster than MyRocks in 5.6.
Compressed InnoDB in MySQL 5.6 and 5.7 suffers on the read-write workload. Given that MyRocks matched InnoDB in MySQL 5.6 I wonder if the 5.7 improvements are engine independent.

Multi-threaded & IO-bound

All engines used a 1G buffer pool for these tests with the database cached by the OS. Tests were run for 8 to 48 concurrent clients using 8 sysbench processes and a different table per process. All data for the graphs is here.

Performance summary:

For point-query and read-only uncompressed InnoDB did best, MyRocks was next best and compressed InnoDB was the worst
For read-write uncompressed InnoDB in MySQL 5.7 and MyRocks were best. Other engines were similar and not as good.
For increment MyRocks was the best, followed by InnoDB in MySQL 5.7 followed by InnoDB in MySQL 5.6.

Thursday, February 11, 2016

RocksDB vs the world for loading Linkbench tables

I like RocksDB, MyRocks and MongoRocks because they are IO efficient thanks to excellent compression and low write-amplification. It is a bonus when I get better throughput with the RocksDB family. Here I show that RocksDB has excellent load performance and to avoid benchmarketing the context is loading Linkbench tables when all data fits in cache. This workload should be CPU bound and can be mutex bound. I compare engines for MySQL (RocksDB, TokuDB, InnoDB) and MongoDB (RocksDB, WiredTiger) using the same hardware.

Configuration

I used Linkbench for MySQL and LinkbenchX for MongoDB. The load test was run with generate_nodes=false to limit the load to the count and link tables/collections. The load was repeated with 1 and 8 users (loaders=1 & maxid1=1M, loaders=8 & maxid1=4M). The test server has 144G RAM, 2 sockets, 12 CPU cores (24 HW-threads) and a 400G Intel s3700 SSD.

The oplog, binlog and sync-on-commit were disabled. I did that to remove bottlenecks that might hide stalls in the database engine.

I tested the following binaries:

Mongo.WT - MongoDB 3.2.1, WiredTiger engine, config
Mongo.Rocks - MongoDB 3.2.1, config
My56.Rocks - Facebook MySQL 5.6, RocksDB engine, config, schema
My56.Toku - Percona Server 5.7.26-74, TokuDB engine, config, schema
My56.Inno - MySQL 5.6.29, InnoDB, config, schema
My56.Inno.zlib - MySQL 5.6.29, InnoDB with zlib compression, config, schema
My57.Inno - MySQL 5.7.10, InnoDB, config, schema
My57.Inno.zlib - MySQL 5.7.10, InnoDB with zlib compression, config, schema

Single-threaded

This shows the average insert rate during the single-threaded load. The performance summary is:

MySQL gets more throughput than MongoDB for the same workload because MongoDB uses more CPU. MongoDB also suffers from using more indexes on the Linkbench tables (the extra indexes are internal) as explained previously. I hope they fix this.
For MongoDB, data is loaded faster with WiredTiger than RocksDB because RocksDB uses more CPU.
MySQL+RocksDB and MySQL+TokuDB were by far the fastest for single-threaded loads if you ignore uncompressed InnoDB and I ignore it because I must have compression. Compressed InnoDB suffers from the overhead of doing some (de)compression operations in the foreground.
Uncompressed InnoDB is about 10% slower in MySQL 5.7 compared to 5.6 for the single-thread load. Low-concurrency performance regressions in MySQL is a known problem.

This shows the value of: (CPU utilization * 1000) / insert_rate. That includes CPU consumed in the foreground by the thread processing the users requests and in the background. The value is inversely correlated with the insert rate. The value is larger for MongoDB than for MySQL which explains why the insert rate is larger for MySQL.

Multi-threaded

This shows the average insert rate during the load with 8 user threads. The performance summary is:

MySQL+RocksDB is a lot faster than everything else except uncompressed InnoDB and I ignore that because I require compression.
Compressed InnoDB suffers from mutex contention on the pessimistic code path. See this.
TokuDB suffers from mutex contention and is slower here than at 1 thread. See this.
Mongo+RocksDB suffers from mutex contention and the difference between it and WiredTiger is larger here than for the single-threaded load.

This is the rate of context switches per insert. The context switch rate was measured by vmstat. Several of the engines suffer from too much mutex contention.

Tuesday, February 9, 2016

Compaction priority in RocksDB

Compaction priority is an option in RocksDB that determines the next file to select for compaction. Here I show the impact of this option on the amount of storage writes while running Linkbench. The right value can reduce the amount of data written to storage which improves storage efficiency and can make an SSD device last longer.

The Linkbench schema has 3 tables with 4 indexes. Each of the indexes uses a separate column family with MyRocks. There is one column family for each of the primary key indexes on the link, count and node tables (link_pk, count_pk, node_pk) and another column family for the secondary index on the link table (link_id1_type). The schema for MyRocks is here.

Configuration

I ran Linkbench with maxid1=1B and requesters=20 for 3 days on a server with 24 CPU cores, 256 GB of RAM and an MLC SSD. I then looked at the compaction IO statistics via SHOW ENGINE ROCKSDB STATUS to determine the amount of storage writes per column family. The graphs below use the integer value for the compaction priority:

0 = kByCompensatedSize, the default
1 = kOldestLargestSeqFirst
2 = kOldestSmallestSeqFirst, similar to LevelDB

This graph displays the total amount written per column family for each of the compaction_pri options. The least data is written with kOldestSmallestSeqFirst and the most data is written with kOldestLargestSeqFirst. The difference is about 1.34X. I have been using the default value, kByCompensatedSize, for all of my tests prior to this result.

The next graph shows the amount of data written into RocksDB by the application (MyRocks) by column family. I call this the ingest. The storage writes done by RocksDB includes the ingest and writes done for compaction in the background. As expected the ingest is similar for each of the compaction_pri values.

The final result is the write-amplification which is the ratio of total-writes / ingest. A smaller write-amplification is usually better. Because the total-writes (the first graph) are largest for compaction_pri=1 it has the largest write-amplification.

Monday, February 8, 2016

Concurrent inserts and the RocksDB memtable

RocksDB started as a fork of LevelDB. While it will always be a fork, we describe it as derived from LevelDB given how much code has changed. It inherited a memtable that did not allow concurrent inserts courtesy of a mutex - readers ignored the mutex, writers must lock it. Writers waiting for the mutex linked their updates together and the thread at the head of the convoy applied all of their changes after getting the mutex.

Insert performance for RocksDB and LevelDB did not improve beyond one thread when sync on commit was disabled because of contention on the mutex. It could improve with more threads when sync on commit was enabled courtesy of group commit. Write-only workloads are rare for a DBMS that uses a B-Tree because read-modify-write of leaf nodes is done for updates and inserts. Write-only is more common with RocksDB thanks to the merge operator and because non-unique secondary index maintenance doesn't need RMW. We expect to make this more common via SQL features in MyRocks.

We continue to make RocksDB better and today I show how the concurrent insert rate improves by 3X with sync on commit disabled and 2X with it enabled. Hopefully we will get a wiki entry to explain the code changes, until then thank you Nathan. The new behavior is enabled by setting two configuration options to true: allow_concurrent_memtable_write and enable_write_thread_adaptive_yield.

These tests were done using a server with 24 cores (48 HW threads) and enough RAM to cache the database. The test script is here. The workload is write-only with threads doing Put operations on randomly selected keys.

Results with sync on commit disabled

This section has results with sync on commit disabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 3X at high concurrency.

Results with sync on commit enabled

This section has results with sync on commit enabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 2X at high concurrency.

Speedup for MyRocks

I used a smaller server (20 cores instead of 24) to measure the benefit from this change for loading linkbench tables in parallel. The linkbench load step was run with generate_nodes=false to disable the single-threaded load of the node table and maxid1=30M. The database is small enough to fit in cache. The speedup is 1.4X at high concurrency. I will be vague but there is another configuration change that improves the speedup to 2.4X.