Wednesday, May 31, 2017

Sysbench, IO-bound & Intel NUC for MySQL 5.6, 5.7 & 8

In this post I share results from sysbench with an IO-bound workload on Intel NUC servers. My previous post shared results from an in-memory workload. This is part of my work from bug 86215 as I identify CPU performance regressions from MySQL 5.6 to 5.7 and 8.

tl;dr

  • The results here are similar to, but not as bad as, the results from the in-memory workload. This is reasonable because CPU regressions can be hidden by IO-bound tests. But if you scroll to the bottom of this page and look at the graph for the point-query test you will see there are problems even for IO-bound tests.
  • For the i5 NUC most of the regression is from 5.6 to 5.7
  • For the i3 NUC, MySQL 5.7 did better especially on range scans but there is still a regression from 5.6 to 8. From many tests I have run it looks like someone did great work in MySQL 5.7 to make range scans more efficient in InnoDB.
  • For long scans the overhead from the default charset/collation in MySQL 8 is significant compared to latin1/latin1_swedish_ci.
Configuration

I tested MySQL using upstream 5.6.35, 5.7.17 and 8.0.1. For 8.0.1 I repeated the test: first with the default charset/collation and then with latin1/latin1_swedish_ci.

My servers are described here. The charts below refer to the servers as i3 NUC and i5 NUC. Compared to the i3 NUC, the i5 NUC is newer, has a CPU that is 2X faster, 2X more RAM and storage that is more than 2X faster.

My usage of sysbench, including the tests run, and the my.cnf used for each database engine is described here. The my.cnf files I share were for the i3 NUC. For the i5 NUC the InnoDB buffer pool and IO capacity options were increased using these values. I use the same server for mysqld and the sysbench clients. The binlog is enabled but sync-on-commit is disabled. Sysbench is run with 4 tables and 40M rows per table. The database is larger than RAM. Although in future tests I might need to use larger tables for the i5 NUC which has twice the memory of the i3 NUC.

Results

I first ran these tests on the i3 NUC and then was curious about performance on a more modern CPU so I setup a cluster of new i5 NUC servers. I am a big fan of Intel NUC and have 6 at home and have yet to trip a circuit breaker, overheat my office or suffer from too much fan noise. The results on the i5 NUC are different than on the i3 NUC. On the i3 NUC MySQL 5.7 has much less of a regression from 5.6. While on the i5 NUC most of the regression is from MySQL 5.6 to 5.7. Perhaps one day I will have time to explain that, but I am too busy right now.

The results are in the order in which tests are run.

There is a small regression from 5.6 to 5.7/8 for update-index on the i5 NUC. This test requires secondary index maintenance as part of the update (read-modify-write of secondary index leaf pages). 
The regression for i5 NUC for update-nonindex is larger than for update-index. This test is less IO-bound than update-index because secondary index maintenance is not needed.
The regression for delete is worse on the i5 NUC. If fact, there isn't a regression on the i3 NUC.

The regression for write-only is worse on the i5 NUC and there isn't a regression on the i3 NUC.
For read-write the regression is worse on the i5 NUC. MySQL 5.7 does better than 5.6 on the i3 NUC and that difference is larger for the larger range scan (10,000 row) than the smaller (100 row).
For read-only using latin1/latin1_swedish_ci improves QPS at 1000 and 10,000 row range scans with MySQL 8 compared to the default charset/collation. Again, MySQL 5.7 does great on the i3 NUC and does better as the range scan increases. MySQL 8 has regressions on both the i3 and i5 NUC and that isn't explained by charset.



For point-query the regression on the i3 NUC starts with MySQL 8 and on the i5 NUC starts with MySQL 5.7.
For insert the regression is larger on the i5 NUC.

Tuesday, May 30, 2017

Sysbench, in-memory & Intel NUC

This continues my series on low-concurrency performance regressions, Here I share results for in-memory sysbench using my home Intel NUC servers. The results were first shared in bug 86215 and the numbers for the charts in this post are here.

tl;dr
  • For the i5 NUC most of the regression is from MySQL 5.6 to 5.7
  • For the i3 NUC MySQL 5.7.17 matches 5.6.35 in many tests and does much better on range scans. Alas I have yet to debug this. But MySQL 8.0.1 is much worse than 5.6.35. At first I only had results for the i3 NUC and assumed the regression was from 5.7 to 8 but now I think that most of the regression comes from MySQL 5.6 to 5.7. Soon I will repeat these tests on servers I use at work that have a better CPU than what I get in the i5 NUC.
  • For long range scans there is more overhead with the default charset/collation in MySQL 8.0.1 and switching back to latin1/latin1_swedish_ci improves QPS.

Configuration

I tested MySQL using upstream 5.6.35, 5.7.17 and 8.0.1. For 8.0.1 I repeated the test: first with the default charset/collation and then with latin1/latin1_swedish_ci.

My servers are described here. The charts below refer to the servers as i3 NUC and i5 NUC. Compared to the i3 NUC, the i5 NUC is newer, has a CPU that is 2X faster, 2X more RAM and storage that is more than 2X faster.

My usage of sysbench, including the tests run, and the my.cnf used for each database engine is described here. The my.cnf files I share were for the i3 NUC. For the i5 NUC the InnoDB buffer pool and IO capacity options were increased using these values. I use the same server for mysqld and the sysbench clients. The binlog is enabled but sync-on-commit is disabled. Sysbench is run with 4 tables and 1M rows per table. The database fits in the InnoDB buffer pool.

Results

I first ran these tests on the i3 NUC and assumed that the regression was mostly from MySQL 5.6 to MySQL 8. Then I acquired the i5 NUC servers and repeated tests and it looks like more of the regression is from MySQL 5.6 to 5.7. Regardless, I hope we can make this better.

The results are in the order in which the tests are run. For several tests MySQL 5.7.17 does better on the i3 NUC than the i5 NUC, relative to MySQL 8. That is most apparent on the tests that do longer range scans. I thought that was a mistake but the result is repeatable. From past tests there appears to be improvements to range scan performance for InnoDB in MySQL 5.7. Perhaps this is an artifact of that improvement.

The results for update-index are typical. The regression is worse on the i5 NUC.
The regression for update-nonindex is worse than for update-index and worse on the i5 NUC.
The regression for delete is worse on the i5 NUC.
The regression for write-only is worse on the i5 NUC.
For read-write with a 100 row range scan MySQL 5.7.17 does better on the i3 NUC. I didn't debug that. Otherwise the regressions are similar between the i3 NUC and i5 NUC. Switching the charset for MySQL 8.0.1 from utf to latin1 has a small impact.
For read-write with a 10,000 row range scan MySQL 5.7.17 does better on the i3 NUC. I didn't debug that. Otherwise the regressions are similar between the i3 NUC and i5 NUC. Switching the charset for MySQL 8.0.1 from utf to latin1 has a big impact.
For the read-only tests using latin1/latin1_swedish_ci improves QPS at 1000 and 10,000 row range scans. MySQL 5.7.17 does great on the i3 NUC but has regressions on the i5 NUC. MySQL 8.0.1 has regressions on the i3 and i5 NUC.
For point-query MySQL 5.7.17 does better on the i3 NUC. Otherwise the regressions from MySQL 5.6 to newer releases are similar.
For insert MySQL 5.7.17 does better on the i3 NUC. Otherwise the regressions from MySQL 5.6 to newer releases are similar.

Short guide on using performance_schema for user & table stats

It took me too long to figure this out while reading the chapter on PS in the MySQL manual. Hopefully this saves me time the next time I need to figure it out. I think it gives me the equivalent of the data I get from IS.user_statistics and IS.table_statistics when using FB MySQL. From a few tests I ran the overhead from the PS was small, maybe less than 5%, while collecting this data.
  1. Add performance_schema=1 to my.cnf
  2. For table stats: select * from table_io_waits_summary_by_table
  3. For user stats: select * from events_statements_summary_by_account_by_event_name
Update - great advice from Mark Leith for MySQL 5.7 and newer
For table stats see the docs and run: SELECT * FROM sys.schema_table_statistics
For user stats see the docs and run: SELECT * FROM sys.user_summary

Monday, May 29, 2017

The history of low-concurrency performance regressions in MySQL 5.6, 5.7 and 5.8

What is a reasonable goal for performance regressions between major releases of MySQL (5.6, 5.7, 5.8)? Some regressions are expected because more features means longer code paths. But what if MySQL 5.7 and 5.8 only get 2/3 of the QPS versus MySQL 5.6 at low concurrency?

The low-concurrency performance regressions from MySQL 5.6 to 5.7 continue in MySQL 8. That isn't a surprise for two reasons. First, more features usually make a DBMS slower. Second, all of the performance marketing for new MySQL releases focuses on high-concurrency workloads. One example of the regression occurs for in-memory sysbench where MySQL 5.7 and MySQL 8 get 60% to 70% of the QPS compared to 5.6, which is the same as writing that 5.6 gets 1.43X to 1.67X more QPS than MySQL 5.7 and MySQL 8.

A deployment that I care about uses MySQL 5.6 and we want to use MySQL 8. Alas, we need to reduce the performance regressions for that to happen. I am excited that the community has another chance to make MySQL better because I don't think this will be fixed without us. I have been writing about this problem since 2013. I am also happy to note that the performance schema isn't the problem. I rediscovered this problem when using MySQL 8 on my home test servers and filed bug 86215. While I have published results showing the regression from earlier releases to MySQL 5.6 -- at this point I just want to get back to the low-concurrency QPS we get from MySQL 5.6. I have good memories from MySQL 4.0, 5.0 and 5.1, but I can do without the excitement of running those releases at web-scale.

I first wrote about this problem in 2013 when comparing MySQL 5.6 to previous releases and filed bugs 68825 and 69236. My favorite low-concurrency performance regression will always be bug 29921. I still remember debugging that over a weekend. Parsers are hard to profile when code is generated and everything is in one function. Fortunately I was able to use rdtsc.

Bugs for low-concurrency performance regressions

Bugs filed by me:
  • 68825 - April 2013 -- performance regressions for single-threaded workloads
  • 69236 - May 2013 -- performance regressions for single-threaded workloads, part 2
  • 74325 - October 2014 (fixed) -- updates to indexed column much slower in 5.7.5
  • 74342 - October 2014 -- InnoDB disk reads at 1 thread much slower in 5.7.5
  • 86215 - May 2017 -- MySQL is much slower in 5.7 than 5.6
Bugs not filed by me:
  • 71130 - December 2013 -- 5.6 SQL thread is much slower than 5.1/5.5
  • 78176 - August 2015 -- 5.6.20 is almost twice as slow as 5.0.96
Finally, there are two interesting bugs for high-concurrency regressions in InnoDB that I want to revisit: bug 74280 (open) and bug 74283 (fixed). I wrote about this in a post on range scan performance.

Content for low-concurrency performance regressions

Reports from me
  • March 2013 - MySQL 5.6: single-threaded, read-only
  • April 2013 - MySQL 5.6: single-thread, update-only
  • May 2013 - MySQL 5.6 versus 4.0 for a read-only workload
  • May 2013 - MySQL 5.6: single-threaded performance regressions
  • September 2013 - MySQL 5.7.2 single threaded performance needs improvement
  • September 2013 - MySQL 4.1 forever
  • October 2014 - Single thread performance in MySQL 5.7.5 versus older releases via sql-bench
  • October 2014 - Low-concurrency performance for point lookups: MySQL 5.7.5 vs previous releases
  • October 2014 - Low-concurrence performance regressions for range queries: MySQL 5.7 vs previous releases
  • October 2014 - Low-concurrency performance for updates and the Heap engine: MySQL 5.7 vs previous releases
  • October 2014 - Low-concurrency performance for updates with InnoDB: MySQL 5.7 vs previous releases
  • October 2014 - Page read performance: MySQL 5.7 vs previous releases
  • October 2014 - Sysbench cached updates: MySQL 5.7 vs previous releases
  • October 2014 - Sysbench IO-bound updates: MySQL 5.7 vs previous releases
  • October 2014 - Updates with secondary index maintenance: 5.7 vs previous releases
  • August, 2015 - Single-threaded linkbench performance for MySQL 5.7, 5.6, WebScale and MyRocks
  • February 2017 - Using modern sysbench to compare MyRocks and InnoDB on a small server
  • February 2017 - Part 2 - sysbench, MyRocks, InnoDB and a small server
  • May 2017 - Sysbench, in-memory & Intel NUC
  • May 2017 - Sysbench, IO-bound & Intel NUC for MySQL 5.6, 5.7 & 8
  • June 2017 - Insert benchmark, in-memory & Intel NUC for MySQL 5.6, 5.7 and 8
  • June 2017 - Insert benchmark, IO-bound & Intel NUC for MySQL 5.6, 5.7 & 8
  • June 2017 - Linkbench, in-memory & Intel NUC for MySQL 5.6, 5.7 & 8
  • June 2017 - Linkbench, IO-bound & Intel NUC for MySQL 5.6, 5.7 & 8
  • June 2017 - Sysbench for MySQL 5.0, 5.1, 5.5, 5.6, 5.7 and 8
  • June 2017 - Impact of perf schema on sysbench at low concurrency
  • June 2017 - Linux perf and the CPU regression in MySQL 5.7

Reports from others:
  • March 2013 - Why MySQL Performance at Low Concurrency is Important
  • December 2013 - Single thread performance regression in 5.6 - Replication

Tuesday, May 23, 2017

MyRocks in MariaDB 10.2.6

MariaDB 10.2.6 is GA. It includes MyRocks, which isn't GA yet but is much easier to use thanks to the hard work of the MariaDB team. This is a big deal if you want performant efficiency or efficient performance. For some workloads it provides better performance than InnoDB. For most (approximately all) workloads it uses much less space and writes much more efficiently. We continue to make RocksDB and MyRocks better and there are many interesting projects in progress. It runs in production, so Domas has begun finding even more things to improve. My contribution is performance evaluation and while I have been busy running tests I have fallen behind on sharing the results. I will do better over the next few months.

Saturday, May 20, 2017

Small servers for database performance tests

I use Intel NUC servers at home to test open source databases. I like them because they are small, quiet and don't use much power. For about 2 years I have been using NUC5i3ryh servers with a 5th gen core i3 CPU, 8gb of RAM, 2.5" SATA disk for the OS and 120gb Samsung 850 EVO m.2 for the database. I used this so much that I replaced the SSD devices last year after one reached the endurance limit.

I am upgrading to a new setup using NUC7i5bnh. This has a 7th gen core i5, 16gb of RAM, 2.5" SATA SSD (Samsung 850 EVO) for the OS and m.2 SSD (Samsung 960 EVO) for the database. It has twice the RAM, twice the CPU and more than twice the IOPs of my old setup. The old and new setups use Ubuntu 16.04 server.

The install was easy with one exception. The old setup used wired networking. This time I enabled wireless after the install finished and that took a few hours to figure out. The important steps are:
  1. Install the HWE enabled kernel to get the drivers that support Intel wireless HW in this server. I didn't do this at first and dmesg | grep iwl showed nothing even though the firmware for that Intel HW was installed. With the HWE kernel I see this in dmesg output Detected Intel(R) Dual Band Wireless AC 8265. The HWE kernel can be selected at the GRUB menu during the install. I assume this step won't be needed once the NUC7i5bnh HW becomes less new.
  2. After the install finishes, install wireless-tools via sudo apt-get install wireless-tools. Without this ifup -v wlan0 failed.
  3. Edit /etc/network/interfaces. This assumes you are using an unsecured network. See below.
I changed /etc/network/interfaces to enable wireless and disable wired using the following contents. After editing the file I tested my changes via sudo ifup -v wlp58s0. If you get it wrong this will take a few minutes to fail. Note that $name is the name for your wireless network and that this works when you are running an unprotected 
# The loopback network interface
auto lo
iface lo inet loopback 
# Wired networking is not started automatically
auto eno1
iface eno1 inet manual
#iface eno1 inet dhcp 
# Wireless networking is started automatically
auto wlp58s0
iface wlp58s0 inet dhcp
wireless-essid $name
wireless-mode Managed
First performance comparison is "make -j4" for MySQL 8.0.1 - 1307 seconds for old NUC, 684 seconds for new NUC.

Friday, May 12, 2017

Database IO performance tests

I work with InnoDB and RocksDB storage engines for MySQL and do performance tests to compare both storage engines and storage devices. I have expertise in MySQL and storage engines but not so much in storage devices, so I don't mind running MySQL. Other people have expertise in the layers under MySQL (Linux, storage) and might mind running MySQL. Fortunately, we have benchmark clients for them.

Obviously there is fio and it is my first choiceThe fio team even added support for coordinated omission when I asked for it. Alas it can't generate all of the IO patterns that I need.

It would be great to share a link to io.go here were Domas to publish that code.

I wrote innosim many years ago to simulate InnoDB IO patterns. Docs are here and a helper script to run a sequence of tests is here.

Finally there is db_bench for RocksDB. One challenge with RocksDB is tuning, so I have a script to help with that and use good options to run a sequence of tests in a special pattern. Well, it works as long as I keep the script current and I just updated it today. It runs these benchmarks in order:
  1. fillseq - Put N key-value pairs in key order
  2. overwrite - Put N key-value pairs in random order. Queries done after a key-order load avoid a few sources of overhead that usually should not be avoided, so this shuffles the database.
  3. overwrite - Put key-values pairs in random order. Runs for K seconds.
  4. updaterandom - do read-modify-write in random order. Runs for K of seconds.
  5. readwhilewriting - 1 rate-limited writer and T threads doing Get.
  6. seekrandomwhilewriting - 1 rate-limited writer and T threads doing range scans.

Thursday, May 11, 2017

InnoDB, MyRocks and TokuDB on the insert benchmark

This post shows some of the improvements we recently made to RocksDB to reduce response time variance for write-heavy workloads. This work helps RocksDB, MyRocks and MongoRocks.
This also extends the result I shared for the impact of the InnoDB redo log size on insert benchmark load throughout. Here I add results for MyRocks and TokuDB. In case you haven't heard, the goal for MyRocks is to provide similar performance to InnoDB with much better space and write efficiency. We have real workloads where InnoDB uses 4X more space than MyRocks.
tl;dr
  • Sorry for overusing the word much.
  • Write response time variance is much better in recent RocksDB builds
  • All engines have room for improvement to reduce write response time variance
  • The MyRocks advantage increases when moving from faster to slower storage
  • MyRocks insert rates are not slowed when compression is enabled.
  • While the write rates I demonstrate here for all engines are impressive, the rates might not be sustainable if I expect the SSD devices to last for more than one year. Of course, better write-efficiency from MyRocks helps a lot with endurance.
Making RocksDB better

My summary of the recent improvements to RocksDB is too vague. I hope the authors write real posts soon to explain their contributions. The work includes:
  • perform L0 to L0 compaction to reduce the number of L0 files when the L1 is busy
  • add compaction_pri=kMinOverlapping to be more clever about the key on which SST files start (or stop) to reduce write-amplification during compaction
  • increase delayed_write_rate from 2MB to 16MB/second to smooth the impact from throttling
Configuration

While I usually run the insert benchmark in 3 steps (insert only, insert & queries, insert & queries) I only share results for the insert only step here. Be wary about drawing too many conclusions from an insert-only workload. I will soon share results for the insert & queries steps.

What I wrote in the Configuration section of my previous post is still valid. I tested InnoDB, MyRocks and TokuDB and the following names describe the engine and configuration:
  • Rocks.def10.kmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and with kMinOverlappingRatio
  • Rocks.def10.nokmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and without kMinOverlappingRatio.
  • Rocks.def10.kmin.apr - MyRocks from April 14 using the def10 my.cnf without compression and with kMinOverlappingRatio.
  • Rocks.def10.nokmin.apr - MyRocks from April 14 using the def10 my.cnf withtout compression and without kMinOverlappingRatio.
  • Rocks.def10.kmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and with kMinOverlappingRatio.
  • Rocks.def10.nokmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and without kMinOverlappingRatio.
  • InnoDB-5.7.17.redoX - InnoDB from MySQL 5.7.17 with an Xgb redo log.
  • InnoDB-5.6.35.redoX - InnoDB from MySQL 5.6.35 with an Xgb redo log.
  • TokuDB-5.7.17-none - TokuDB without compression from Percona Server for MySQL 5.7.17
  • TokuDB-5.7.17-zlib - TokuDB with zlib compression from Percona Server for MySQL 5.7.17
I compiled MySQL from source for MyRocks (FB MySQL) and InnoDB (upstream MySQL). I used Percona Server for TokuDB. The TokuDB block cache was 150gb for in-memory workloads and 10gb for IO-bound workloads and the my.cnf is here. The InnoDB buffer pool was 180gb for in-memory workloads and 35gb for IO-bound workloads. The my.cnf for InnoDB is here for 5.6.35 and here for 5.7.17. The base my.cnf for MyRocks is here. It needs to be edited (grep for TODO) because the MyRocks my.cnf files listed above differ in a few ways
  1. The value for RocksDB block_cache_size (150gb for in-memory, 10gb for IO-bound).
  2. Whether compaction_pri=kMinOverlappingRatio was set
  3. Whether compression was enabled. With compression enabled I used none for L0, L1 & L2, then LZ4 starting at L3 and finally bottommost_compression=kZSTDCompression.
In-memory load

The database fits in the database cache for the in-memory load. There should be no reads to storage and many writes to storage.

For fast SSD InnoDB in MySQL 5.7.17 has the best throughput and is much faster than in 5.6.35. A larger InnoDB redo log improves throughput. TokuDB has is faster than MyRocks, but that changes when the workload switches from insert-only to inserts and queries.

Results are different for the server with slow SSD. InnoDB depends on fast random writes and slow SSD provides less of that than. Here MyRocks is faster than InnoDB in 5.7.17 except when a large redo log (32gb) is use. Is is faster than InnoDB in 5.6.35 in all cases. I did not test TokuDB on this hardware.

From the In-memory load metrics section below, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space than MyRocks (see size) with or without compression. A larger redo log reduces the InnoDB write rate per insert by more than half. The kMinOverlappingRatio option in MyRocks greatly reduces the write rate to storage.

Compared to MyRocks, TokuDB uses more than 2X the disk space in the database directory (see size) and uses more CPU/insert (see Mcpu/i). TokuDB might be trading more space amplification to get less write amplification and faster inserts as explained by the RUM Conjecture. MyRocks doesn't make that trade with leveled compaction. It does with universal compaction, which I did not test.

In-memory load metrics

See the In-memory load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av  ips.99  wkb/i   Mcpu/i  size    rss  wmb/s   cpu  engine
120250   4386   0.78    157      61     2.0  186.1  18.9  Rocks.def10.kmin.mar
132979   7786   0.78    156      63    22.0  206.7  20.8  Rocks.def10.kmin,aor
232666   9925   5.31    176      97   104.4 1235.1  41.0  InnoDB-5.7.17.redo4
295683  18009   1.66    154      97   104.1  490.6  45.5  InnoDB-5.7.17.redo32
 92404   3504   5.11    201      97   106.2  472.5  18.6  InnoDB-5.6.35.redo4
171116  10165   1.74    175      97   106.2  297.9  30.0  InnoDB-5.6.35.redo32
207555  10852   0.78    212     146   176.3  162.3  44.0  TokuDB-5.7.17-none

Slow SSD

ips.av ips.99  wkb/i Mcpu/i size    rss   wmb/s   cpu  engine
115607   4315  0.72     131   61    1.9   163.2   15.2 Rocks.def10.kmin.mar
123793   5834  1.16     168   64    1.8   285.5   20.8 Rocks.def10.nokmin.mar
130548   7724  0.72     130   61   21.8   184.2   17.0 Rocks.def10.kmin.apr
128833   7555  1.09     154   69   20.4   280.9   19.8 Rocks.def10.nokmin.apr
 68672   3277  4.95     165   97  104.4   339.9   11.3 InnoDB-5.7.17.redo4
177179   7231  0.85     130   97  104.3   151.3   23.0 InnoDB-5.7.17.redo32
 38058   1691  5.01     176   97  106.3   190.6    6.7 InnoDB-5.6.35.redo4
 71317   2914  1.26     145   97  106.3    89.5   10.3 InnoDB-5.6.35.redo32

IO-bound load

Things are different for the IO-bound load compared to the in-memory load. MyRocks is strictly faster than InnoDB for the IO-bound load and the redo log size doesn't make a big difference for InnoDB. InnoDB in 5.7.17 does better than in 5.6.35. MyRocks insert rates don't drop when compression is enabled, while they do for TokuDB.

Secondary index maintenance for InnoDB is read-modify-write. Whether the reads are done as part of the insert or deferred to the change buffer, eventually those random reads must get done and they use IO capacity that is then not available for random writes. Statements can also stall on page writeback when the buffer pool is full and pages at the tail of the LRU are dirty (insert joke about single page flush here).

Results for fast SSD and slow SSD are similar but the difference between MyRocks and InnoDB is larger on slow SSD because InnoDB depends more on random IO performance. The slow SSD results also show the benefit from using kMinOverlappingRatio when compression is enabled. The IO-bound load metrics section shows the benefit from kMinOverlapping ratio -- wKB/i and rKB/i are about 2/3 the rate compared to MyRocks without that option.

TokuDB has the fastest insert rate but it also uses much more space (see wKB/i) and CPU (see Mcpu/i) compared to MyRocks. I suspect it is trading more space amplification to get less write amplification as explained by the RUM Conjecture.

Compared to MyRocks, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space (see size) even compared to uncompress MyRocks.


IO-bound load metrics

See the IO-bound load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av ips.99  r/i   rkb/i wkb/i Mcpu/i size  rss   r/s  rmb/s  wmb/s   cpu  engine
126727   4671  0     0.23  2.07  175    226   4.9   266   29.6  523.2  22.2  Rocks.def10.kmin.mar
129879   7638  0     0.34  1.98  174    225  12.8   383   43.6  513.3  22.7  Rocks.def10.kmin.apr
128816   7687  0     0.13  1.64  209    101  14.1   147   16.6  421.0  27.0  Rocks.def10.kmin.apr.zstd
 61711   3353  0.09  1.48 14.65  562    402   4.0  5698   91.2  904.3  34.7  InnoDB-5.7.17.redo4
 63959   3419  0.09  1.43 10.11  535    402   4.0  5732   91.7  646.6  34.2  InnoDB-5.7.17.redo32
 45874   1790  0.11  1.71 11.25  585    402  43.2  4915   78.6  516.0  26.8  InnoDB-5.6.35.redo4
 59312   2473  0.10  1.56  8.36  593    403  43.2  5776   92.4  495.7  35.2  InnoDB-5.6.35.redo32
156250   4150  0.01  0.82  2.97  340    400  11.2  1854  128.1  464.5  53.1  TokuDB-5.7.17-none
123259   1535  0.01  0.25  1.00  540    143  11.2   854   30.4  123.5  66.6  TokuDB-5.7.17-zlib

Slow SSD

ips.av ips.99  r/i   rkb/i wkb/i Mcpu/i size  rss   r/s  rmb/s  wmb/s   cpu  engine
122963   5280  0.01  1.46  1.55  192    232  12.8  1443  179.6  380.0  23.7  Rocks.def10.nokmin.apr
127316   7283  0     0.32  0.91  148    225  12.9   334   40.8  230.1  18.8  Rocks.def10.kmin.apr
125842   7420  0     0.11  0.73  177    101  14.1   112   13.4  182.7  22.3  Rocks.def10.kmin.apr.zstd
 82771   1980  0.01  0.66  1.11  259    114  13.7   445   54.5  182.7  21.4  Rocks.def10.nokmin.apr.zstd
 34101   1975  0.10  1.60 10.91  402    400  39.7  3412   54.6  372.0  13.7  InnoDB-5.7.17.redo4
 49244   1750  0.10  1.56  6.39  454    403  40.0  4803   76.9  314.5  22.4  InnoDB-5.7.17.redo32
 17654    605  0.18  2.79 12.22  448    398  43.2  3083   49.3  215.7   7.9  InnoDB-5.6.35.redo4
 30734   1276  0.09  1.50  5.01  407    403  43.2  2882   46.1  153.9  12.5  InnoDB-5.6.35.redo32

In-memory throughput over time

Fast SSD

InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls, similar to the March 17 build of MyRocks.



Slow SSD

InnoDB suffers from too much variance regardless of redo log size. It needs random write IOPs and the slow SSD has less of that than the fast SSD. MyRocks with the April 14 build is much better than the March 17 build because of the change to perform L0 to L0 compaction to reduce the chance of stalls.





IO-bound throughput over time

Fast SSD

This is similar to the in-memory results. InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls.




Slow SSD

InnoDB has too much variance, redo log size isn't significant but 5.7.17 is better than 5.6.35. For MyRocks the April 14 build is better than March 17 and the kMinOverlappingRatio feature helps.




Tuesday, May 9, 2017

Benchmark(et)ing with InnoDB redo log size

What is a typical InnoDB redo log size? By size I mean the product of innodb_log_file_size and innodb_log_files_in_group. A typical setup for me is 4gb via 2gb/file and 2 files.

Using a large InnoDB redo log can help performance but it comes at a cost. I have been using a 4gb redo log in my tests because that is what we frequently use in production. Dimitri mentioned using a 32gb redo log to show great throughput for InnoDB in recent releases. Here I share results from the insert benchmark with a 4gb, 8gb, 16gb and 32gb redo log.

tl;dr - conclusions specific to my test
  1. A larger redo log improves throughput
  2. A larger redo log helps more with slower storage than with faster storage because page writeback is more of a bottleneck with slower storage and a larger redo log reduces writeback.
  3. A larger redo log can help more when the working set is cached because there are no stalls from storage reads and storage writes are more likely to be a bottleneck.
  4. InnoDB in MySQL 5.7.17 is much faster than 5.6.35 in all cases except IO-bound + fast SSD
A larger redo log means that less checkpoint IO will be done and less IO is good. But there are costs with a larger redo log. It uses more storage and you might not want to spend 32gb of SSD for the redo log. It makes crash recovery slower. With buffered IO you can have 32gb of redo log competing to stay in the OS page cache and waste memory. Finally when using a disk-array, the read-before-write that occurs with buffered IO means that redo log writes may be wasting too much IO.

Configuration

I used my insert benchmark client with two test servers -- one with slow SSD that does ~10k IOPs and one with fast SSD that does more than 100k IOPs. Both servers have 24 cores, 48 HW-threads and 256gb of RAM. Tests were repeated for an in-memory workload (insert 500m rows) and an IO-bound workload (limit server RAM to 50gb and insert 2b rows). The test used 16 tables, 16 clients and each client inserted rows to a different table. Typical command line were:
    # insert 500m rows
    bash iq.sh innodb "" /path/to/bin/mysql /path/to/datadir md2 \
        1 16 no no no 0 no 500000000
    # insert 2b rows
    bash iq.sh innodb "" /path/to/bin/mysql /path/to/datadir md2 \
        1 16 no no no 0 no 2000000000

Test were run with upstream MySQL 5.6.35 and 5.7.17. I used these my.cnf files for 5.7.17 and 5.6.35. I compiled MySQL from source and used jemalloc. The benchmark client ran on the same host as mysqld to remove variance from the network. The binlog is enabled but sync on commit is disabled for InnoDB and the binlog to support faster insert rates.

In-memory load

For the in-memory load the redo log size has more impact with slow SSD than with fast SSD. From the In-memory load metrics section below you can see the benefit from a larger redo log. It reduces the amount of data written to storage per insert by more than half (see the wKB/i column) when going from a 4gb to a 32gb redo log. For fast SSD with 5.7.17, InnoDB writes to storage 5.31 KB/insert with a 4gb redo log versus 1.66 KB/insert with a 32gb redo log. Similar reductions occur for slow SSD and for 5.6.35. The larger redo log helps slow SSD more than fast SSD based on the reduction in wKB/i.
In-memory load metrics

Legend:
  • ips.av, ips.99 - average and p99 insert rates. The 99th percentile is computed from the per-interval rates where the interval is ~10 seconds.
  • wKB/i, rKB/i - KB written to and KB read from storage per inserted row. Measured by iostat.
  • r/i - storage reads per inserted row. Measured by iostat.
  • Mcpu/i - relative CPU overhead per inserted row. Measured by vmstat us and sy columns.
  • size - database size in GB at test end
  • rss - mysqld RSS in GB at test end
  • wMB/s, rMB/s - average MB/second written to and read from storage. Measured by iostat.
  • r/s - average storage reads/second. Measured by iostat.
  • cpu - average CPU utilization. Measured by vmstat us and sy columns.
  • engine - database engine. none means that no compression was used. 150g and 35g are the size of the InnoDB buffer pool. redoX is the size of the InnoDB redo log.

Fast SSD

ips.av  ips.99  wKB/i   Mcpu/i  size    rss  wMB/s   cpu  engine
232666   9925   5.31    176      97   104.4 1235.1  41.0  inno5717.none.150g.redo4
286041  16559   3.46    162      97   104.1  989.0  46.4  inno5717.none.150g.redo8
292740  17587   2.41    157      97   104.3  706.5  46.0  inno5717.none.150g.redo16
295683  18009   1.66    154      97   104.1  490.6  45.5  inno5717.none.150g.redo32
 92404   3504   5.11    201      97   106.2  472.5  18.6  inno5635.none.150g.redo4
134844   5987   2.97    192      97   106.1  400.8  25.8  inno5635.none.150g.redo8
169319   9247   2.02    176      97   106.1  342.5  29.8  inno5635.none.150g.redo16
171116  10165   1.74    175      97   106.2  297.9  30.0  inno5635.none.150g.redo32

Slow SSD

ips.av ips.99  wKB/i Mcpu/i size    rss   wMB/s   cpu     engine
 68672   3277  4.95     165   97  104.4   339.9   11.3    inno5717.none.150g.redo2
 95896   4518  2.67     154   97  104.4   256.4   14.7    inno5717.none.150g.redo4
131787   5868  1.47     135   97  104.4   194.2   17.8    inno5717.none.150g.redo8
177179   7231  0.85     130   97  104.3   151.3   23.0    inno5717.none.150g.redo16
 38058   1691  5.01     176   97  106.3   190.6    6.7    inno5635.none.150g.redo2
 52493   2537  2.74     156   97  106.3   144.0    8.2    inno5635.none.150g.redo4
 61043   2660  1.64     151   97  106.3   100.2    9.2    inno5635.none.150g.redo8
 71317   2914  1.26     145   97  106.3    89.5   10.3    inno5635.none.150g.redo16

IO-bound load

For the IO-bound load and fast SSD the redo log size has little impact for 5.7.17 and some impact for 5.6.35. For slow SSD the redo log size has more impact. But compared to the in-memory results above the impact from redo log size is much less for the IO-bound workload. From the IO-bound load metrics section below the reduction in wKB/i is less for IO-bound than for in-memory workloads but increasing the redo log from 4gb to 32gb still reduces the write rate in half for slow SSD per the wKB/i column.

Per the ips.av column InnoDB with 5.7.17 isn't much faster than 5.6.35 for the fast SSD. It is still much faster for the slow SSD.

IO-bound load metrics

Fast SSD

ips.av ips.99  r/i   rKB/i wKB/i Mcpu/i size  rss   r/s  rMB/s  wMB/s   cpu  engine

 61711   3353  0.09  1.48 14.65  562    402   4.0  5698   91.2  904.3  34.7  inno5717.none.35g.redo2
 62929   3471  0.09  1.47 11.39  546    402   4.0  5771   92.3  717.0  34.4  inno5717.none.35g.redo4
 63593   2542  0.10  1.54 11.01  554    404   4.0  6133   98.1  700.2  35.2  inno5717.none.35g.redo8
 63959   3419  0.09  1.43 10.11  535    402   4.0  5732   91.7  646.6  34.2  inno5717.none.35g.redo16
 45874   1790  0.11  1.71 11.25  585    402  43.2  4915   78.6  516.0  26.8  inno5635.none.35g.redo2
 58682   2500  0.10  1.55  8.57  589    403  43.2  5667   90.7  502.8  34.6  inno5635.none.35g.redo4
 59179   2285  0.10  1.52  8.30  581    402  43.2  5607   89.7  491.2  34.4  inno5635.none.35g.redo8
 59312   2473  0.10  1.56  8.36  593    403  43.2  5776   92.4  495.7  35.2  inno5635.none.35g.redo16

Slow SSD

ips.av ips.99  r/i   rKB/i wKB/i Mcpu/i size  rss   r/s  rMB/s  wMB/s   cpu  engine
 34101   1975  0.10  1.60 10.91  402    400  39.7  3412   54.6  372.0  13.7  inno5717.none.35g.redo2
 46712   1837  0.09  1.49  7.29  433    403  39.9  4362   69.8  340.7  20.2  inno5717.none.35g.redo4
 45231   1649  0.10  1.54  5.34  420    404  40.0  4355   69.7  241.5  19.0  inno5717.none.35g.redo8
 49244   1750  0.10  1.56  6.39  454    403  40.0  4803   76.9  314.5  22.4  inno5717.none.35g.redo16
 17654    605  0.18  2.79 12.22  448    398  43.2  3083   49.3  215.7   7.9  inno5635.none.35g.redo2
 26607    863  0.12  1.86  7.56  438    402  43.2  3099   49.6  201.2  11.7  inno5635.none.35g.redo4
 28069   1143  0.09  1.51  5.50  398    403  43.2  2650   42.4  154.2  11.2  inno5635.none.35g.redo8
 30734   1276  0.09  1.50  5.01  407    403  43.2  2882   46.1  153.9  12.5  inno5635.none.35g.redo16

In-memory throughput over time

The results above show average throughput and that hides a lot of interesting behavior. We expect throughput over time to not suffer from variance -- for both InnoDB and for MyRocks. For many of the results below there is a lot of variance (jitter).

Fast SSD

InnoDB suffers from jitter with 4gb and 8gb redo logs and the problem is much worse for 5.6.35. Another problem occurs with a 4gb redo log -- throughput drops over time. That problem repeats for an 8gb redo log with 5.6.35.






Slow SSD

With slow SSD jitter is much worse for 5.7.17 than for 5.6.35 and the difference is more significant with the larger redo log. Throughput also drops over time for 5.7.17. This is a surprising result.






IO-bound throughput over time

Fast SSD

MySQL 5.6.35 suffers from jitter while 5.7.17 suffers from throughput dropping over time. Although the drop in 5.7 might be reasonable assuming this occurs because the working set doesn't fit in cache and storage reads must be done during index maintenance. It is also interesting that 5.7.17 becomes as slow as 5.6.35 over time.






Slow SSD

Both 5.6.35 and 5.7.17 suffer from jitter.