Tuesday, September 20, 2016

MyRocks and InnoDB with large objects and compression

I ran tests to explain the difference between MyRocks and InnoDB when storing large objects and data with varying amounts of compressibility.

Compression in MyRocks is simpler than in InnoDB. You should expect the database to use about 1.1X times the size of the compressed output. When rows compress to 60% of their original size and are 10kb before compression, then each row should use about 6.6kb in the database. The 1.1X adjustment is for space-amplification from leveled compaction.

Predicting the space used for InnoDB is harder. First, large LOB column are not stored inline and overflow pages are not shared. Second, disk pages have a fixed size and you risk using too much space or getting too many page splits when searching for a good value for key_block size. More details are here.

I ran two tests for two types of data. The first test is an insert only workload in PK-order for data with varying amounts of compressibility. The second test determined how fast point queries could be done on that data while rate-limited inserts were in progress. By varying amounts of compressibility I mean that there was one large varchar column per row and that 20%, 45%, 75% or 95% of the data in the column was random and the remainder was constant and easily compressed. Both tests used one connection for inserts. The query test also used one connection for queries.

The test pattern was run twice. In both cases the large column was a varchar. In the first case it had a length between 10,000 and 20,000 characters. In the second case it had a length between 100 and 1000 characters. The database block size was 16kb for MyRocks and InnoDB.

Insert only

For the insert-only workload the space used for MyRocks can be predicted from the compressibility of the data. That is much less true for InnoDB. For example compressed InnoDB uses about the same amount of space for pctRand in 20, 45 and 75.

MyRocks used the least amount of space. InnoDB used much more space when the column was larger (10,000 to 20,000 vs 100 to 1000). Overflow pages are the root cause.

The insert rates are better for MyRocks than for InnoDB. They were also stable for MyRocks and uncompressed InnoDB independent of the compressibility. Rates for uncompressed InnoDB are better than compressed InnoDB. While this wasn't a performance benchmark, it matches many other results I get. It is hard to get performance and compression from InnoDB.  The CPU overhead per insert was similar between MyRocks and uncompressed InnoDB. CPU overheads were mostly larger for compressed InnoDB.

Legend for the data:
  • ips - inserts per second
  • size - database size in GB at test end
  • Mcpu - microseconds of CPU per insert
  • pctRand - percentage of random data in large column
  • engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

column up to 20,000       column up to 1000
ips     size    Mcpu      ips     size    Mcpu    pctRand engine
5489      7.7    1090     34468   11      151     20      rx.zlib-6
5540     16      1127     34824   19      149     45
5532     24      1307     34517   27      166     75
5523     30      1467     34701   33      160     95

ips     size    Mcpu      ips     size    Mcpu    pctRand engine
3995     87       933     23470   66      173     20      i6n
3981     87       928     23704   66      174     45
3981     86       917     23487   66`     175     75
3995     88       914     23658   66      176     95

ips     size    Mcpu      ips     size    Mcpu    pctRand engine
3339     36      1064     13429   33      262     20      i6c
2779     32      1278     13124   33      271     45
2133     35      1750      8767   30      392     75
1757     50      2061      7228   38      461     95

Point queries

MyRocks provides the best compression, the best query throughput, and the east CPU overhead per query. My conclusions for InnoDB space consumption are similar to the results from the insert-only workload.

Legend for the data:
  • qps - queries per second
  • size - database size in GB at test end
  • Mcpu - microseconds of CPU per query
  • pctRand - percentage of random data in large column
  • engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 984      9.3    4308     2214    11      1585    20      rx.zlib-6
 910     19      4532     2113    19      1627    45
 846     30      4952     2102    27      1601    75
 795     37      5598     2051    33      1691    95

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 628    113      6240     1302    62      2527    20      i6n
 624    110      6226     1300    63      2501    45
 624    114      6312     1302    63      2536    75
 628    115      6218     1305    66      2474    95

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 708     38      5560      770    34      4450    20      i6c
 629     39      6643      687    34      4895    45
 513     44      8494      589    30      6046    75
 418     57     10619      576    39      6599    95

Thursday, September 15, 2016

Peak benchmarketing season for MySQL

Maybe this is my XKCD week. With Oracle Open World and Percona Live Amsterdam we are approaching peak benchmarketing season for MySQL. I still remember when MySQL 4.0 was limited to about 10k QPS on 4 and 8 core servers back around 2005, so the 1M QPS results we see today are a reminder of the great progress that has been made thanks to investments by upstream and the community.

In General

But getting 1.5M QPS today compared to 1M QPS last year isn't at the top of the list for many (potential) users of MySQL. I use performance, usability, mangeability, availability and efficiency to explain what matters for web-scale DBMS users. My joke is that each of these makes a different group happy: performance -> marketing, usability -> developers, manageability -> operations, availability -> end users, efficiency -> management.

The benchmarketing results mostly focus on performance. Whether InnoDB does a bit more QPS than Amazon Aurora isn't going to make Aurora less popular. Aurora might have excellent performance but I assume people are deploying it for other reasons. I hope we make it easier to market usability, manageability, availability and efficiency in the MySQL community. MongoDB has gone a long way by marketing and then delivering usability and manageability.

Even when limited to performance we need to share more than peak QPS. Efficiency and quality-of-service (QoS) are equally important. QPS without regard to response time is frequently a bogus metric. I get more IOPs from a disk by using a too large queue depth. But more IOPs at the cost of 100 millisecond disk read response times is an expensive compromise. Even when great QPS is accompanied by a good average response time I want to know if there is lousy QoS from frequent stalls leading to lousy 99th percentile response times. Percona has built their business in part by being excellent at documenting and reducing stalls in InnoDB that occur on benchmarks and real workloads.

I have been guilty of sharing too many benchmark reports in the past that ignored efficiency and QoS. I have been trying to change that this year and hope that other providers of MySQL performance results do the same. This is an example of a result that includes performance, efficiency and QoS.

MyRocks and RocksDB

A lot of the RocksDB marketing message has been about performance. Database access is faster with an embedded database than client/server because you avoid network latency. The MyRocks message has been about efficiency. The target has been better compression and less write amplification than InnoDB so you can use less SSD and lower-endurance SSD. For a workload I care about we see 2X better compression and 1/10 the write rate to storage. This is a big deal.

When starting the project we had many discussions about the amount of performance loss (reduced QPS, higher response time) we could tolerate to get more efficiency. While we were vague the initial goal was to get similar QPS and response time to InnoDB for real workloads, but we were willing to accept some regressions. It turned out that there was no regression and similar performance with much better efficiency is a big deal.

But benchmarks aren't real workloads and there will soon be more benchmark results. Some of these will repeat what I have claimed, others will not. I don't expect to respond to every result that doesn't match my expectations. I will consult when possible.

One last disclaimer. If you care about read-mostly/in-memory workloads then InnoDB is probably an excellent choice. MyRocks can still be faster than InnoDB for in-memory workloads. That is more likely when the bottleneck for InnoDB is page write-back performance. So write-heavy/in-memory can still be a winner for MyRocks.

Seriously, this is the last disclaimer. While we are bickering about benchmark results others are focusing on usability and manageability and getting all of the new deployments.

Wednesday, September 14, 2016

Excited about Percona Live Amsterdam

I look forward to attending and speaking at Percona Live Amsterdam. There are presentations and people with expertise on MySQL, MongoDB and PostgreSQL. This is a great chance to learn from people with different experience and different expertise. I hope you make the most of it.

A great way to learn is to ask questions like:
  • Why are you using X?
  • What problems are you trying to solve?

A great way to end the conversation early and avoid learning anything is to ask questions like:
  • Why aren't you using Y? 
  • You should be using Z! (this isn't a question)

zlib vs zstd for MyRocks running Linkbench

I used an IO-heavy configuration to determine the impact of zstandard vs zlib compression for MyRocks. There was about 1 read from SSD per transaction and decompression is done after each page read from the OS page cache and storage.

The results are impressive. Zstandard compresses like zlib level 1 but uses much less CPU.
  • zstandard reduces CPU by 45% vs zlib level 1 for the load test
  • zstandard reduces CPU by 11% vs zlib level 1 for the query test
  • zstandard gets 8% more TPS vs zlib level 1 for the query test


Configuration for MyRocks is still complex. The templates for the MyRocks my.cnf files for Linkbench and general usage are explained on the wiki. I used no compression for L0, L1, L2, then lz4 for all but the max level and then one of zlib level 1, zlib level 6 or zstd for the max level. The tests used an Aug5 build of MyRocks, so this used kZSTDNotFinalCompression as the build preceded the 1.0 release of zstandard.

The test host has 50G of RAM available to userland, fast storage (5TB of NVMe MLC) and 24 CPU cores with 48 HW threads. The RocksDB block cache was set to 10G, the binlog was disabled but sync-on-commit was disabled for the binlog and RocksDB. Linkbench is run with maxid1=1B, the load test uses 2 clients and the query tests use 16 clients. Query tests are run as 24 1-hour loops and I report metrics from the 24th hour. I used my branch of linkbench and support scripts.


The results for zstandard are impressive. I look forward to using this in production. Thanks Yann.

  • ips/tps - inserts & transactions per second
  • r/i, r/t - iostat reads per insert and per transaction
  • wKB/i, wKB/t - iostat KB written per insert and per transaction
  • Mcpu/i, Mcpu/t - usecs of CPU time per insert and per transaction
  • size - database size in GB
  • rss - mysqld RSS size in GB
  • un, gn, ul, gl - p99 response time in milliseconds for the most frequent transactions (Update Node, Get Node, Update Link, Get Link List)

Results for the load

ips     r/i     rKB/i   wKB/i   Mcpu/i  size    rss     engine
61543   0       0       0.98     81     324     3.1     zstd
61504   0       0       0.98    146     331     2.0     zlib-1
61457   0       0       0.97    153     312     2.2     zlib-6

Results for the 24th hour of the query test

tps    r/t   rKB/t   wKB/t  Mcpu/t  size  rss   un    gn   ul  gl   engine
39366  1.00  10.38   2.36    878    377   12.2  0.6   0.6  1   0.8  zstd
36524  1.00  10.47   2.45    992    381   12.1  0.7   0.6  1   0.9  zlib-1
37233  0.97   9.76   2.30   1002    360   12.0  0.7   0.7  1   0.9  zlib-6

Wednesday, September 7, 2016

Tuning the RocksDB block cache

I spent many years using InnoDB with direct IO and I didn't worry about buffered IO performance. Well, I didn't worry until Domas told me to worry. My focus has switched to RocksDB and now I worry about buffered IO performance. Fortunately, another co-worker (Jens Axboe) promises to make buffered writeback much better.

With direct IO, InnoDB stores compressed and uncompressed pages in the InnoDB buffer pool. It has a clever algorithm to determine how much memory to use for each based on whether the workload appears to be IO or CPU bound. My vague memory is that we tune my.cnf to keep it from being too clever.

With buffered IO, RocksDB manages a block cache for uncompressed blocks and then depends on the OS page cache for compressed blocks. While I think there is an opportunity to be more efficient in that area, that is not the topic for today.

The question today is how to divide memory between the RocksDB block cache and the OS page cache. I have read tuning advice for other buffered IO databases that suggest giving as much RAM as possible to the database. I disagree and my advice is:
  1. If the uncompressed working set fits in the RocksDB block cache then give as much RAM as possible to the block cache.
  2. Else if the compressed working set fits in the OS page cache then give most RAM to the OS page cache by using a small RocksDB block cache.
  3. Else give the RocksDB block cache about 20% of host RAM.
This is a rule of thumb. Sometimes in rule 3 I suggest giving 25% or 30% to the block cache, but I hope you get the point. The goal is to avoid reads from storage by caching more data in RAM. I assume that decompressing a block is much faster than reading it from storage which is more likely when you use zstandard.

This isn't proven unless you accept proof by anecdote. I ran a test with Linkbench on a host with 50G of RAM and a ~350G database. The test was repeated with the RocksDB block cache set to 5G, 10G, 20G and 35G. Using a smaller block cache reduced the storage read cost per transaction by between 10% and 20% using iostat r/s and iostat rKB/s. My advice might not work for you, but might help you to consider your choices before following tuning advice you read on the web.