Small Datum: May 2017

Wednesday, May 31, 2017

Sysbench, IO-bound & Intel NUC for MySQL 5.6, 5.7 & 8

Update - the regression isn't as bad as I have been reporting. Read this post to understand why.

In this post I share results from sysbench with an IO-bound workload on Intel NUC servers. My previous post shared results from an in-memory workload. This is part of my work from bug 86215 as I identify CPU performance regressions from MySQL 5.6 to 5.7 and 8.

tl;dr

The results here are similar to, but not as bad as, the results from the in-memory workload. This is reasonable because CPU regressions can be hidden by IO-bound tests. But if you scroll to the bottom of this page and look at the graph for the point-query test you will see there are problems even for IO-bound tests.
For the i5 NUC most of the regression is from 5.6 to 5.7
For the i3 NUC, MySQL 5.7 did better especially on range scans but there is still a regression from 5.6 to 8. From many tests I have run it looks like someone did great work in MySQL 5.7 to make range scans more efficient in InnoDB.
For long scans the overhead from the default charset/collation in MySQL 8 is significant compared to latin1/latin1_swedish_ci.

Configuration

I tested MySQL using upstream 5.6.35, 5.7.17 and 8.0.1. For 8.0.1 I repeated the test: first with the default charset/collation and then with latin1/latin1_swedish_ci.

My servers are described here. The charts below refer to the servers as i3 NUC and i5 NUC. Compared to the i3 NUC, the i5 NUC is newer, has a CPU that is 2X faster, 2X more RAM and storage that is more than 2X faster.

My usage of sysbench, including the tests run, and the my.cnf used for each database engine is described here. The my.cnf files I share were for the i3 NUC. For the i5 NUC the InnoDB buffer pool and IO capacity options were increased using these values. I use the same server for mysqld and the sysbench clients. The binlog is enabled but sync-on-commit is disabled. Sysbench is run with 4 tables and 40M rows per table. The database is larger than RAM. Although in future tests I might need to use larger tables for the i5 NUC which has twice the memory of the i3 NUC.

Results

I first ran these tests on the i3 NUC and then was curious about performance on a more modern CPU so I setup a cluster of new i5 NUC servers. I am a big fan of Intel NUC and have 6 at home and have yet to trip a circuit breaker, overheat my office or suffer from too much fan noise. The results on the i5 NUC are different than on the i3 NUC. On the i3 NUC MySQL 5.7 has much less of a regression from 5.6. While on the i5 NUC most of the regression is from MySQL 5.6 to 5.7. Perhaps one day I will have time to explain that, but I am too busy right now.

The results are in the order in which tests are run.

There is a small regression from 5.6 to 5.7/8 for update-index on the i5 NUC. This test requires secondary index maintenance as part of the update (read-modify-write of secondary index leaf pages).

The regression for i5 NUC for update-nonindex is larger than for update-index. This test is less IO-bound than update-index because secondary index maintenance is not needed.

The regression for delete is worse on the i5 NUC. If fact, there isn't a regression on the i3 NUC.

The regression for write-only is worse on the i5 NUC and there isn't a regression on the i3 NUC.

For read-write the regression is worse on the i5 NUC. MySQL 5.7 does better than 5.6 on the i3 NUC and that difference is larger for the larger range scan (10,000 row) than the smaller (100 row).

For read-only using latin1/latin1_swedish_ci improves QPS at 1000 and 10,000 row range scans with MySQL 8 compared to the default charset/collation. Again, MySQL 5.7 does great on the i3 NUC and does better as the range scan increases. MySQL 8 has regressions on both the i3 and i5 NUC and that isn't explained by charset.

For point-query the regression on the i3 NUC starts with MySQL 8 and on the i5 NUC starts with MySQL 5.7.

For insert the regression is larger on the i5 NUC.

Tuesday, May 30, 2017

Sysbench, in-memory & Intel NUC

Update - the regression isn't as bad as I have been reporting. Read this post to understand why.

This continues my series on low-concurrency performance regressions, Here I share results for in-memory sysbench using my home Intel NUC servers. The results were first shared in bug 86215 and the numbers for the charts in this post are here.

tl;dr

For the i5 NUC most of the regression is from MySQL 5.6 to 5.7
For the i3 NUC MySQL 5.7.17 matches 5.6.35 in many tests and does much better on range scans. Alas I have yet to debug this. But MySQL 8.0.1 is much worse than 5.6.35. At first I only had results for the i3 NUC and assumed the regression was from 5.7 to 8 but now I think that most of the regression comes from MySQL 5.6 to 5.7. Soon I will repeat these tests on servers I use at work that have a better CPU than what I get in the i5 NUC.
For long range scans there is more overhead with the default charset/collation in MySQL 8.0.1 and switching back to latin1/latin1_swedish_ci improves QPS.

Configuration

I tested MySQL using upstream 5.6.35, 5.7.17 and 8.0.1. For 8.0.1 I repeated the test: first with the default charset/collation and then with latin1/latin1_swedish_ci.

My servers are described here. The charts below refer to the servers as i3 NUC and i5 NUC. Compared to the i3 NUC, the i5 NUC is newer, has a CPU that is 2X faster, 2X more RAM and storage that is more than 2X faster.

My usage of sysbench, including the tests run, and the my.cnf used for each database engine is described here. The my.cnf files I share were for the i3 NUC. For the i5 NUC the InnoDB buffer pool and IO capacity options were increased using these values. I use the same server for mysqld and the sysbench clients. The binlog is enabled but sync-on-commit is disabled. Sysbench is run with 4 tables and 1M rows per table. The database fits in the InnoDB buffer pool.

Results

I first ran these tests on the i3 NUC and assumed that the regression was mostly from MySQL 5.6 to MySQL 8. Then I acquired the i5 NUC servers and repeated tests and it looks like more of the regression is from MySQL 5.6 to 5.7. Regardless, I hope we can make this better.

The results are in the order in which the tests are run. For several tests MySQL 5.7.17 does better on the i3 NUC than the i5 NUC, relative to MySQL 8. That is most apparent on the tests that do longer range scans. I thought that was a mistake but the result is repeatable. From past tests there appears to be improvements to range scan performance for InnoDB in MySQL 5.7. Perhaps this is an artifact of that improvement.

The results for update-index are typical. The regression is worse on the i5 NUC.

The regression for update-nonindex is worse than for update-index and worse on the i5 NUC.

The regression for delete is worse on the i5 NUC.

The regression for write-only is worse on the i5 NUC.

For read-write with a 100 row range scan MySQL 5.7.17 does better on the i3 NUC. I didn't debug that. Otherwise the regressions are similar between the i3 NUC and i5 NUC. Switching the charset for MySQL 8.0.1 from utf to latin1 has a small impact.

For read-write with a 10,000 row range scan MySQL 5.7.17 does better on the i3 NUC. I didn't debug that. Otherwise the regressions are similar between the i3 NUC and i5 NUC. Switching the charset for MySQL 8.0.1 from utf to latin1 has a big impact.

For the read-only tests using latin1/latin1_swedish_ci improves QPS at 1000 and 10,000 row range scans. MySQL 5.7.17 does great on the i3 NUC but has regressions on the i5 NUC. MySQL 8.0.1 has regressions on the i3 and i5 NUC.

For point-query MySQL 5.7.17 does better on the i3 NUC. Otherwise the regressions from MySQL 5.6 to newer releases are similar.

For insert MySQL 5.7.17 does better on the i3 NUC. Otherwise the regressions from MySQL 5.6 to newer releases are similar.

Short guide on using performance_schema for user & table stats

It took me too long to figure this out while reading the chapter on PS in the MySQL manual. Hopefully this saves me time the next time I need to figure it out. I think it gives me the equivalent of the data I get from IS.user_statistics and IS.table_statistics when using FB MySQL. From a few tests I ran the overhead from the PS was small, maybe less than 5%, while collecting this data.

Add performance_schema=1 to my.cnf
For table stats: select * from table_io_waits_summary_by_table
For user stats: select * from events_statements_summary_by_account_by_event_name

Update - great advice from Mark Leith for MySQL 5.7 and newer
For table stats see the docs and run: SELECT * FROM sys.schema_table_statistics

For user stats see the docs and run: SELECT * FROM sys.user_summary

Monday, May 29, 2017

The history of low-concurrency performance regressions in MySQL 5.6, 5.7 and 5.8

Update - the regression isn't as bad as I have been reporting. Read this post to understand why.

What is a reasonable goal for performance regressions between major releases of MySQL (5.6, 5.7, 5.8)? Some regressions are expected because more features means longer code paths. But what if MySQL 5.7 and 5.8 only get 2/3 of the QPS versus MySQL 5.6 at low concurrency?

The low-concurrency performance regressions from MySQL 5.6 to 5.7 continue in MySQL 8. That isn't a surprise for two reasons. First, more features usually make a DBMS slower. Second, all of the performance marketing for new MySQL releases focuses on high-concurrency workloads. One example of the regression occurs for in-memory sysbench where MySQL 5.7 and MySQL 8 get 60% to 70% of the QPS compared to 5.6, which is the same as writing that 5.6 gets 1.43X to 1.67X more QPS than MySQL 5.7 and MySQL 8.

A deployment that I care about uses MySQL 5.6 and we want to use MySQL 8. Alas, we need to reduce the performance regressions for that to happen. I am excited that the community has another chance to make MySQL better because I don't think this will be fixed without us. I have been writing about this problem since 2013. I am also happy to note that the performance schema isn't the problem. I rediscovered this problem when using MySQL 8 on my home test servers and filed bug 86215. While I have published results showing the regression from earlier releases to MySQL 5.6 -- at this point I just want to get back to the low-concurrency QPS we get from MySQL 5.6. I have good memories from MySQL 4.0, 5.0 and 5.1, but I can do without the excitement of running those releases at web-scale.

I first wrote about this problem in 2013 when comparing MySQL 5.6 to previous releases and filed bugs 68825 and 69236. My favorite low-concurrency performance regression will always be bug 29921. I still remember debugging that over a weekend. Parsers are hard to profile when code is generated and everything is in one function. Fortunately I was able to use rdtsc.

Bugs for low-concurrency performance regressions

Bugs filed by me:

68825 - April 2013 -- performance regressions for single-threaded workloads
69236 - May 2013 -- performance regressions for single-threaded workloads, part 2
74325 - October 2014 (fixed) -- updates to indexed column much slower in 5.7.5
74342 - October 2014 -- InnoDB disk reads at 1 thread much slower in 5.7.5
86215 - May 2017 -- MySQL is much slower in 5.7 than 5.6

Bugs not filed by me:

71130 - December 2013 -- 5.6 SQL thread is much slower than 5.1/5.5
78176 - August 2015 -- 5.6.20 is almost twice as slow as 5.0.96

Finally, there are two interesting bugs for high-concurrency regressions in InnoDB that I want to revisit: bug 74280 (open) and bug 74283 (fixed). I wrote about this in a post on range scan performance.

Content for low-concurrency performance regressions

Reports from me

Reports from others:

March 2013 - Why MySQL Performance at Low Concurrency is Important
December 2013 - Single thread performance regression in 5.6 - Replication

Tuesday, May 23, 2017

MyRocks in MariaDB 10.2.6

MariaDB 10.2.6 is GA. It includes MyRocks, which isn't GA yet but is much easier to use thanks to the hard work of the MariaDB team. This is a big deal if you want performant efficiency or efficient performance. For some workloads it provides better performance than InnoDB. For most (approximately all) workloads it uses much less space and writes much more efficiently. We continue to make RocksDB and MyRocks better and there are many interesting projects in progress. It runs in production, so Domas has begun finding even more things to improve. My contribution is performance evaluation and while I have been busy running tests I have fallen behind on sharing the results. I will do better over the next few months.

Saturday, May 20, 2017

Small servers for database performance tests

I use Intel NUC servers at home to test open source databases. I like them because they are small, quiet and don't use much power. For about 2 years I have been using NUC5i3ryh servers with a 5th gen core i3 CPU, 8gb of RAM, 2.5" SATA disk for the OS and 120gb Samsung 850 EVO m.2 for the database. I used this so much that I replaced the SSD devices last year after one reached the endurance limit.

I am upgrading to a new setup using NUC7i5bnh. This has a 7th gen core i5, 16gb of RAM, 2.5" SATA SSD (Samsung 850 EVO) for the OS and m.2 SSD (Samsung 960 EVO) for the database. It has twice the RAM, twice the CPU and more than twice the IOPs of my old setup. The old and new setups use Ubuntu 16.04 server.

First performance comparison is "make -j4" for MySQL 8.0.1 - 1307 seconds for old NUC, 684 seconds for new NUC.

BIOS

I disabled turbo mode in the BIOS on the NUC7i5bnh. There is no turbo mode on the NUC5i3rvh. This was done to avoid frequent variance in performance -- CPU goes into turbo mode, then gets too hot and disables it, repeat. Perhaps if I had these devices in a cold room this would not be a problem.

Storage

I use two storage devices and one handles the OS install while the other is used for perf tests. The perf test SSD wears out and gets replaced. I don't want to lose my OS install when that happens. I mount it at /data and the entry in /etc/fstab is listed below. I use noauto to avoid problems when the device has failed (because it is worn out) and mount fails during boot. I prefer XFS to ext4 for database workloads. The perf test device is mounted at /data. It is mounted after reboot. Sometimes I forget to do that.

UUID=... /data xfs noatime,nodiratime,discard,noauto 0 1

This post explains how to get device endurance stats from the SSD.

Debugging

Ubuntu default security options get in the way of PMP. This fixes that:

echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"

Huge Pages

For some engines I disable huge pages and this script makes that easy:

echo $1 > /sys/kernel/mm/transparent_hugepage/defrag
echo $1 > /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag
cat /sys/kernel/mm/transparent_hugepage/enabled

Can also try this and then run update-grub:

GRUB_CMDLINE_LINUX_DEFAULT="transparent_hugepage=never"

NUC7 networking

The install was easy with one exception. The old setup used wired networking. This time I enabled wireless after the install finished and that took a few hours to figure out. The important steps are:

Install the HWE enabled kernel to get the drivers that support Intel wireless HW in this server. I didn't do this at first and dmesg | grep iwl showed nothing even though the firmware for that Intel HW was installed. With the HWE kernel I see this in dmesg output Detected Intel(R) Dual Band Wireless AC 8265. The HWE kernel can be selected at the GRUB menu during the install. I assume this step won't be needed once the NUC7i5bnh HW becomes less new.
After the install finishes, install wireless-tools via sudo apt-get install wireless-tools. Without this ifup -v wlan0 failed.
Edit /etc/network/interfaces. This assumes you are using an unsecured network. See below.

I changed /etc/network/interfaces to enable wireless and disable wired using the following contents. After editing the file I tested my changes via sudo ifup -v wlp58s0. If you get it wrong this will take a few minutes to fail. Note that $name is the name for your wireless network and that this works when you are running an open network. Below is the content for /etc/network/interfaces.

# The loopback network interface
auto lo
iface lo inet loopback

# Wired networking is not started automatically
auto eno1
iface eno1 inet manual
#iface eno1 inet dhcp

# Wireless networking is started automatically
auto wlp58s0
iface wlp58s0 inet dhcp
wireless-essid $name
wireless-mode Managed

Wireless is then stopped and started via sudo ifdown wlp58s0; sudo ifup -v wlp58s0. There has been a bug that causes wireless to stop working. Fixing that requires me to connect directly via a console and restart it. Maybe I could have added a cron job to check for that, but I never got around to it.

NUC5 networking

This is similar to NUC7 networking above but the interface name is wlan0 rather than wlp58s0 and wireless is stopped and started via sudo ifdown wlan0; sudo ifup -v wlan0. Below is the contents for /etc/network/interfaces:

auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet manual
#iface eth0 inet dhcp
# wireless
auto wlan0
iface wlan0 inet dhcp
wireless-essid $name
wireless-mode Managed

Friday, May 12, 2017

Database IO performance tests

I work with InnoDB and RocksDB storage engines for MySQL and do performance tests to compare both storage engines and storage devices. I have expertise in MySQL and storage engines but not so much in storage devices, so I don't mind running MySQL. Other people have expertise in the layers under MySQL (Linux, storage) and might mind running MySQL. Fortunately, we have benchmark clients for them.

Obviously there is fio and it is my first choiceThe fio team even added support for coordinated omission when I asked for it. Alas it can't generate all of the IO patterns that I need.

It would be great to share a link to io.go here were Domas to publish that code.

I wrote innosim many years ago to simulate InnoDB IO patterns. Docs are here and a helper script to run a sequence of tests is here.

Finally there is db_bench for RocksDB. One challenge with RocksDB is tuning, so I have a script to help with that and use good options to run a sequence of tests in a special pattern. Well, it works as long as I keep the script current and I just updated it today. It runs these benchmarks in order:

fillseq - Put N key-value pairs in key order
overwrite - Put N key-value pairs in random order. Queries done after a key-order load avoid a few sources of overhead that usually should not be avoided, so this shuffles the database.
overwrite - Put key-values pairs in random order. Runs for K seconds.
updaterandom - do read-modify-write in random order. Runs for K of seconds.
readwhilewriting - 1 rate-limited writer and T threads doing Get.
seekrandomwhilewriting - 1 rate-limited writer and T threads doing range scans.

Thursday, May 11, 2017

InnoDB, MyRocks and TokuDB on the insert benchmark

This post shows some of the improvements we recently made to RocksDB to reduce response time variance for write-heavy workloads. This work helps RocksDB, MyRocks and MongoRocks.
This also extends the result I shared for the impact of the InnoDB redo log size on insert benchmark load throughout. Here I add results for MyRocks and TokuDB. In case you haven't heard, the goal for MyRocks is to provide similar performance to InnoDB with much better space and write efficiency. We have real workloads where InnoDB uses 4X more space than MyRocks.

tl;dr

Sorry for overusing the word much.
Write response time variance is much better in recent RocksDB builds
All engines have room for improvement to reduce write response time variance
The MyRocks advantage increases when moving from faster to slower storage
MyRocks insert rates are not slowed when compression is enabled.
While the write rates I demonstrate here for all engines are impressive, the rates might not be sustainable if I expect the SSD devices to last for more than one year. Of course, better write-efficiency from MyRocks helps a lot with endurance.

Making RocksDB better

My summary of the recent improvements to RocksDB is too vague. I hope the authors write real posts soon to explain their contributions. The work includes:

perform L0 to L0 compaction to reduce the number of L0 files when the L1 is busy
add compaction_pri=kMinOverlapping to be more clever about the key on which SST files start (or stop) to reduce write-amplification during compaction
increase delayed_write_rate from 2MB to 16MB/second to smooth the impact from throttling

Configuration

While I usually run the insert benchmark in 3 steps (insert only, insert & queries, insert & queries) I only share results for the insert only step here. Be wary about drawing too many conclusions from an insert-only workload. I will soon share results for the insert & queries steps.

What I wrote in the Configuration section of my previous post is still valid. I tested InnoDB, MyRocks and TokuDB and the following names describe the engine and configuration:

Rocks.def10.kmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and with kMinOverlappingRatio
Rocks.def10.nokmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and without kMinOverlappingRatio.
Rocks.def10.kmin.apr - MyRocks from April 14 using the def10 my.cnf without compression and with kMinOverlappingRatio.
Rocks.def10.nokmin.apr - MyRocks from April 14 using the def10 my.cnf withtout compression and without kMinOverlappingRatio.
Rocks.def10.kmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and with kMinOverlappingRatio.
Rocks.def10.nokmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and without kMinOverlappingRatio.
InnoDB-5.7.17.redoX - InnoDB from MySQL 5.7.17 with an Xgb redo log.
InnoDB-5.6.35.redoX - InnoDB from MySQL 5.6.35 with an Xgb redo log.
TokuDB-5.7.17-none - TokuDB without compression from Percona Server for MySQL 5.7.17
TokuDB-5.7.17-zlib - TokuDB with zlib compression from Percona Server for MySQL 5.7.17

I compiled MySQL from source for MyRocks (FB MySQL) and InnoDB (upstream MySQL). I used Percona Server for TokuDB. The TokuDB block cache was 150gb for in-memory workloads and 10gb for IO-bound workloads and the my.cnf is here. The InnoDB buffer pool was 180gb for in-memory workloads and 35gb for IO-bound workloads. The my.cnf for InnoDB is here for 5.6.35 and here for 5.7.17. The base my.cnf for MyRocks is here. It needs to be edited (grep for TODO) because the MyRocks my.cnf files listed above differ in a few ways

The value for RocksDB block_cache_size (150gb for in-memory, 10gb for IO-bound).
Whether compaction_pri=kMinOverlappingRatio was set
Whether compression was enabled. With compression enabled I used none for L0, L1 & L2, then LZ4 starting at L3 and finally bottommost_compression=kZSTDCompression.

In-memory load

The database fits in the database cache for the in-memory load. There should be no reads to storage and many writes to storage.

For fast SSD InnoDB in MySQL 5.7.17 has the best throughput and is much faster than in 5.6.35. A larger InnoDB redo log improves throughput. TokuDB has is faster than MyRocks, but that changes when the workload switches from insert-only to inserts and queries.

Results are different for the server with slow SSD. InnoDB depends on fast random writes and slow SSD provides less of that than. Here MyRocks is faster than InnoDB in 5.7.17 except when a large redo log (32gb) is use. Is is faster than InnoDB in 5.6.35 in all cases. I did not test TokuDB on this hardware.

From the In-memory load metrics section below, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space than MyRocks (see size) with or without compression. A larger redo log reduces the InnoDB write rate per insert by more than half. The kMinOverlappingRatio option in MyRocks greatly reduces the write rate to storage.

Compared to MyRocks, TokuDB uses more than 2X the disk space in the database directory (see size) and uses more CPU/insert (see Mcpu/i). TokuDB might be trading more space amplification to get less write amplification and faster inserts as explained by the RUM Conjecture. MyRocks doesn't make that trade with leveled compaction. It does with universal compaction, which I did not test.

In-memory load metrics

See the In-memory load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av ips.99 wkb/i Mcpu/i size rss wmb/s cpu engine
120250 4386 0.78 157 61 2.0 186.1 18.9 Rocks.def10.kmin.mar
132979 7786 0.78 156 63 22.0 206.7 20.8 Rocks.def10.kmin,aor
232666 9925 5.31 176 97 104.4 1235.1 41.0 InnoDB-5.7.17.redo4
295683 18009 1.66 154 97 104.1 490.6 45.5 InnoDB-5.7.17.redo32
92404 3504 5.11 201 97 106.2 472.5 18.6 InnoDB-5.6.35.redo4
171116 10165 1.74 175 97 106.2 297.9 30.0 InnoDB-5.6.35.redo32
207555 10852 0.78 212 146 176.3 162.3 44.0 TokuDB-5.7.17-none

Slow SSD

ips.av ips.99 wkb/i Mcpu/i size rss wmb/s cpu engine
115607 4315 0.72 131 61 1.9 163.2 15.2 Rocks.def10.kmin.mar
123793 5834 1.16 168 64 1.8 285.5 20.8 Rocks.def10.nokmin.mar
130548 7724 0.72 130 61 21.8 184.2 17.0 Rocks.def10.kmin.apr
128833 7555 1.09 154 69 20.4 280.9 19.8 Rocks.def10.nokmin.apr
68672 3277 4.95 165 97 104.4 339.9 11.3 InnoDB-5.7.17.redo4
177179 7231 0.85 130 97 104.3 151.3 23.0 InnoDB-5.7.17.redo32
38058 1691 5.01 176 97 106.3 190.6 6.7 InnoDB-5.6.35.redo4
71317 2914 1.26 145 97 106.3 89.5 10.3 InnoDB-5.6.35.redo32

IO-bound load

Things are different for the IO-bound load compared to the in-memory load. MyRocks is strictly faster than InnoDB for the IO-bound load and the redo log size doesn't make a big difference for InnoDB. InnoDB in 5.7.17 does better than in 5.6.35. MyRocks insert rates don't drop when compression is enabled, while they do for TokuDB.

Secondary index maintenance for InnoDB is read-modify-write. Whether the reads are done as part of the insert or deferred to the change buffer, eventually those random reads must get done and they use IO capacity that is then not available for random writes. Statements can also stall on page writeback when the buffer pool is full and pages at the tail of the LRU are dirty (insert joke about single page flush here).

Results for fast SSD and slow SSD are similar but the difference between MyRocks and InnoDB is larger on slow SSD because InnoDB depends more on random IO performance. The slow SSD results also show the benefit from using kMinOverlappingRatio when compression is enabled. The IO-bound load metrics section shows the benefit from kMinOverlapping ratio -- wKB/i and rKB/i are about 2/3 the rate compared to MyRocks without that option.

TokuDB has the fastest insert rate but it also uses much more space (see wKB/i) and CPU (see Mcpu/i) compared to MyRocks. I suspect it is trading more space amplification to get less write amplification as explained by the RUM Conjecture.

Compared to MyRocks, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space (see size) even compared to uncompress MyRocks.

IO-bound load metrics

See the IO-bound load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av ips.99 r/i rkb/i wkb/i Mcpu/i size rss r/s rmb/s wmb/s cpu engine
126727 4671 0 0.23 2.07 175 226 4.9 266 29.6 523.2 22.2 Rocks.def10.kmin.mar
129879 7638 0 0.34 1.98 174 225 12.8 383 43.6 513.3 22.7 Rocks.def10.kmin.apr
128816 7687 0 0.13 1.64 209 101 14.1 147 16.6 421.0 27.0 Rocks.def10.kmin.apr.zstd
61711 3353 0.09 1.48 14.65 562 402 4.0 5698 91.2 904.3 34.7 InnoDB-5.7.17.redo4
63959 3419 0.09 1.43 10.11 535 402 4.0 5732 91.7 646.6 34.2 InnoDB-5.7.17.redo32
45874 1790 0.11 1.71 11.25 585 402 43.2 4915 78.6 516.0 26.8 InnoDB-5.6.35.redo4
59312 2473 0.10 1.56 8.36 593 403 43.2 5776 92.4 495.7 35.2 InnoDB-5.6.35.redo32
156250 4150 0.01 0.82 2.97 340 400 11.2 1854 128.1 464.5 53.1 TokuDB-5.7.17-none
123259 1535 0.01 0.25 1.00 540 143 11.2 854 30.4 123.5 66.6 TokuDB-5.7.17-zlib

Slow SSD

ips.av ips.99 r/i rkb/i wkb/i Mcpu/i size rss r/s rmb/s wmb/s cpu engine
122963 5280 0.01 1.46 1.55 192 232 12.8 1443 179.6 380.0 23.7 Rocks.def10.nokmin.apr
127316 7283 0 0.32 0.91 148 225 12.9 334 40.8 230.1 18.8 Rocks.def10.kmin.apr
125842 7420 0 0.11 0.73 177 101 14.1 112 13.4 182.7 22.3 Rocks.def10.kmin.apr.zstd
82771 1980 0.01 0.66 1.11 259 114 13.7 445 54.5 182.7 21.4 Rocks.def10.nokmin.apr.zstd
34101 1975 0.10 1.60 10.91 402 400 39.7 3412 54.6 372.0 13.7 InnoDB-5.7.17.redo4
49244 1750 0.10 1.56 6.39 454 403 40.0 4803 76.9 314.5 22.4 InnoDB-5.7.17.redo32
17654 605 0.18 2.79 12.22 448 398 43.2 3083 49.3 215.7 7.9 InnoDB-5.6.35.redo4
30734 1276 0.09 1.50 5.01 407 403 43.2 2882 46.1 153.9 12.5 InnoDB-5.6.35.redo32

In-memory throughput over time

Fast SSD

InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls, similar to the March 17 build of MyRocks.

Slow SSD

InnoDB suffers from too much variance regardless of redo log size. It needs random write IOPs and the slow SSD has less of that than the fast SSD. MyRocks with the April 14 build is much better than the March 17 build because of the change to perform L0 to L0 compaction to reduce the chance of stalls.

IO-bound throughput over time

Fast SSD

This is similar to the in-memory results. InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls.

Slow SSD

InnoDB has too much variance, redo log size isn't significant but 5.7.17 is better than 5.6.35. For MyRocks the April 14 build is better than March 17 and the kMinOverlappingRatio feature helps.

Tuesday, May 9, 2017

Benchmark(et)ing with InnoDB redo log size

What is a typical InnoDB redo log size? By size I mean the product of innodb_log_file_size and innodb_log_files_in_group. A typical setup for me is 4gb via 2gb/file and 2 files.

Using a large InnoDB redo log can help performance but it comes at a cost. I have been using a 4gb redo log in my tests because that is what we frequently use in production. Dimitri mentioned using a 32gb redo log to show great throughput for InnoDB in recent releases. Here I share results from the insert benchmark with a 4gb, 8gb, 16gb and 32gb redo log.

tl;dr - conclusions specific to my test

A larger redo log improves throughput
A larger redo log helps more with slower storage than with faster storage because page writeback is more of a bottleneck with slower storage and a larger redo log reduces writeback.
A larger redo log can help more when the working set is cached because there are no stalls from storage reads and storage writes are more likely to be a bottleneck.
InnoDB in MySQL 5.7.17 is much faster than 5.6.35 in all cases except IO-bound + fast SSD

A larger redo log means that less checkpoint IO will be done and less IO is good. But there are costs with a larger redo log. It uses more storage and you might not want to spend 32gb of SSD for the redo log. It makes crash recovery slower. With buffered IO you can have 32gb of redo log competing to stay in the OS page cache and waste memory. Finally when using a disk-array, the read-before-write that occurs with buffered IO means that redo log writes may be wasting too much IO.

Configuration

I used my insert benchmark client with two test servers -- one with slow SSD that does ~10k IOPs and one with fast SSD that does more than 100k IOPs. Both servers have 24 cores, 48 HW-threads and 256gb of RAM. Tests were repeated for an in-memory workload (insert 500m rows) and an IO-bound workload (limit server RAM to 50gb and insert 2b rows). The test used 16 tables, 16 clients and each client inserted rows to a different table. Typical command line were:
# insert 500m rows
bash iq.sh innodb "" /path/to/bin/mysql /path/to/datadir md2 \

1 16 no no no 0 no 500000000

# insert 2b rows
bash iq.sh innodb "" /path/to/bin/mysql /path/to/datadir md2 \
1 16 no no no 0 no 2000000000

Test were run with upstream MySQL 5.6.35 and 5.7.17. I used these my.cnf files for 5.7.17 and 5.6.35. I compiled MySQL from source and used jemalloc. The benchmark client ran on the same host as mysqld to remove variance from the network. The binlog is enabled but sync on commit is disabled for InnoDB and the binlog to support faster insert rates.

In-memory load

For the in-memory load the redo log size has more impact with slow SSD than with fast SSD. From the In-memory load metrics section below you can see the benefit from a larger redo log. It reduces the amount of data written to storage per insert by more than half (see the wKB/i column) when going from a 4gb to a 32gb redo log. For fast SSD with 5.7.17, InnoDB writes to storage 5.31 KB/insert with a 4gb redo log versus 1.66 KB/insert with a 32gb redo log. Similar reductions occur for slow SSD and for 5.6.35. The larger redo log helps slow SSD more than fast SSD based on the reduction in wKB/i.

In-memory load metrics

Legend:

ips.av, ips.99 - average and p99 insert rates. The 99th percentile is computed from the per-interval rates where the interval is ~10 seconds.
wKB/i, rKB/i - KB written to and KB read from storage per inserted row. Measured by iostat.
r/i - storage reads per inserted row. Measured by iostat.
Mcpu/i - relative CPU overhead per inserted row. Measured by vmstat us and sy columns.
size - database size in GB at test end
rss - mysqld RSS in GB at test end
wMB/s, rMB/s - average MB/second written to and read from storage. Measured by iostat.
r/s - average storage reads/second. Measured by iostat.
cpu - average CPU utilization. Measured by vmstat us and sy columns.
engine - database engine. none means that no compression was used. 150g and 35g are the size of the InnoDB buffer pool. redoX is the size of the InnoDB redo log.

Fast SSD

ips.av ips.99 wKB/i Mcpu/i size rss wMB/s cpu engine
232666 9925 5.31 176 97 104.4 1235.1 41.0 inno5717.none.150g.redo4
286041 16559 3.46 162 97 104.1 989.0 46.4 inno5717.none.150g.redo8
292740 17587 2.41 157 97 104.3 706.5 46.0 inno5717.none.150g.redo16
295683 18009 1.66 154 97 104.1 490.6 45.5 inno5717.none.150g.redo32
92404 3504 5.11 201 97 106.2 472.5 18.6 inno5635.none.150g.redo4
134844 5987 2.97 192 97 106.1 400.8 25.8 inno5635.none.150g.redo8
169319 9247 2.02 176 97 106.1 342.5 29.8 inno5635.none.150g.redo16
171116 10165 1.74 175 97 106.2 297.9 30.0 inno5635.none.150g.redo32

Slow SSD

ips.av ips.99 wKB/i Mcpu/i size rss wMB/s cpu engine
68672 3277 4.95 165 97 104.4 339.9 11.3 inno5717.none.150g.redo2
95896 4518 2.67 154 97 104.4 256.4 14.7 inno5717.none.150g.redo4
131787 5868 1.47 135 97 104.4 194.2 17.8 inno5717.none.150g.redo8
177179 7231 0.85 130 97 104.3 151.3 23.0 inno5717.none.150g.redo16
38058 1691 5.01 176 97 106.3 190.6 6.7 inno5635.none.150g.redo2
52493 2537 2.74 156 97 106.3 144.0 8.2 inno5635.none.150g.redo4
61043 2660 1.64 151 97 106.3 100.2 9.2 inno5635.none.150g.redo8
71317 2914 1.26 145 97 106.3 89.5 10.3 inno5635.none.150g.redo16

IO-bound load

For the IO-bound load and fast SSD the redo log size has little impact for 5.7.17 and some impact for 5.6.35. For slow SSD the redo log size has more impact. But compared to the in-memory results above the impact from redo log size is much less for the IO-bound workload. From the IO-bound load metrics section below the reduction in wKB/i is less for IO-bound than for in-memory workloads but increasing the redo log from 4gb to 32gb still reduces the write rate in half for slow SSD per the wKB/i column.

Per the ips.av column InnoDB with 5.7.17 isn't much faster than 5.6.35 for the fast SSD. It is still much faster for the slow SSD.

IO-bound load metrics

Fast SSD

ips.av ips.99 r/i rKB/i wKB/i Mcpu/i size rss r/s rMB/s wMB/s cpu engine
61711 3353 0.09 1.48 14.65 562 402 4.0 5698 91.2 904.3 34.7 inno5717.none.35g.redo2
62929 3471 0.09 1.47 11.39 546 402 4.0 5771 92.3 717.0 34.4 inno5717.none.35g.redo4
63593 2542 0.10 1.54 11.01 554 404 4.0 6133 98.1 700.2 35.2 inno5717.none.35g.redo8
63959 3419 0.09 1.43 10.11 535 402 4.0 5732 91.7 646.6 34.2 inno5717.none.35g.redo16
45874 1790 0.11 1.71 11.25 585 402 43.2 4915 78.6 516.0 26.8 inno5635.none.35g.redo2
58682 2500 0.10 1.55 8.57 589 403 43.2 5667 90.7 502.8 34.6 inno5635.none.35g.redo4
59179 2285 0.10 1.52 8.30 581 402 43.2 5607 89.7 491.2 34.4 inno5635.none.35g.redo8
59312 2473 0.10 1.56 8.36 593 403 43.2 5776 92.4 495.7 35.2 inno5635.none.35g.redo16

Slow SSD

ips.av ips.99 r/i rKB/i wKB/i Mcpu/i size rss r/s rMB/s wMB/s cpu engine
34101 1975 0.10 1.60 10.91 402 400 39.7 3412 54.6 372.0 13.7 inno5717.none.35g.redo2
46712 1837 0.09 1.49 7.29 433 403 39.9 4362 69.8 340.7 20.2 inno5717.none.35g.redo4
45231 1649 0.10 1.54 5.34 420 404 40.0 4355 69.7 241.5 19.0 inno5717.none.35g.redo8
49244 1750 0.10 1.56 6.39 454 403 40.0 4803 76.9 314.5 22.4 inno5717.none.35g.redo16
17654 605 0.18 2.79 12.22 448 398 43.2 3083 49.3 215.7 7.9 inno5635.none.35g.redo2
26607 863 0.12 1.86 7.56 438 402 43.2 3099 49.6 201.2 11.7 inno5635.none.35g.redo4
28069 1143 0.09 1.51 5.50 398 403 43.2 2650 42.4 154.2 11.2 inno5635.none.35g.redo8
30734 1276 0.09 1.50 5.01 407 403 43.2 2882 46.1 153.9 12.5 inno5635.none.35g.redo16

In-memory throughput over time

The results above show average throughput and that hides a lot of interesting behavior. We expect throughput over time to not suffer from variance -- for both InnoDB and for MyRocks. For many of the results below there is a lot of variance (jitter).

Fast SSD

InnoDB suffers from jitter with 4gb and 8gb redo logs and the problem is much worse for 5.6.35. Another problem occurs with a 4gb redo log -- throughput drops over time. That problem repeats for an 8gb redo log with 5.6.35.

Slow SSD

With slow SSD jitter is much worse for 5.7.17 than for 5.6.35 and the difference is more significant with the larger redo log. Throughput also drops over time for 5.7.17. This is a surprising result.

IO-bound throughput over time

Fast SSD

MySQL 5.6.35 suffers from jitter while 5.7.17 suffers from throughput dropping over time. Although the drop in 5.7 might be reasonable assuming this occurs because the working set doesn't fit in cache and storage reads must be done during index maintenance. It is also interesting that 5.7.17 becomes as slow as 5.6.35 over time.

Slow SSD

Both 5.6.35 and 5.7.17 suffer from jitter.