Friday, December 30, 2016

The MyRocks community in 2016

The MyRocks community grew in 2016 with significant contributions from early evaluators. I appreciate their help and have been busy documenting the problems that they reported. I look forward to more contributions in 2017.

MariaDB Corporation and Percona announced plans to include MyRocks in their distributions. You will be able to get MyRocks with expert support. Equally big have been the contributions from Sergey Pertunya to MyRocks.

The MySQL team at Alibaba reported poor performance from range scans with sysbench. The problem was memory system contention on a performance counter and the fix made long range scans 2X faster on concurrent workloads.

I have been using sysbench more and created a script to run a sequence of tests with it. During my tests I noticed too much variance during the read-only tests and the problem is non-determinism from the state of the memtable and level 0 of the LSM tree. The overhead for a search of the LSM tree depends on the amount of data in the memtable and the number of files in level 0. This state is not deterministic when a read-only test follows a read-write test. I filed issue 427 to see if we can get RocksDB to adapt and flush data earlier than normal from the memtable and level 0 when a workload becomes read-only.

Sysbench is extremely useful even though the workloads provided by it are synthetic -- especially the read-only tests with a small (multi-GB) database and most tests use a uniform distribution for keys. There is a talk on sysbench at FOSDEM 2017 and I look forward to using it more next year.

Justin Swanhart pointed out the lousy performance MyRocks provided with a default configuration. As a result we improved the default configuration by increasing the memtable to 64M and block cache to 512M and I will share results at FOSDEM. See issues 369375, and 441.

Justin Swanhart and Rick Pizzi reported problems with too-large transactions leading to lousy performance and OOM. MyRocks buffers in memory all changes from a transaction until commit and at commit time the changes are briefly double buffered when copied into the memtable. Until we make this better the workaround is to use rocksdb_commit_in_the_middle for bulk loads. Even after we make this better that option will be useful for bulk loads. Today we have a poor limit on the max size of a transaction via rocksdb_max_row_locks. This is a limit on the number of locked rows rather than on the memory used by a transaction. Even worse, the default is 1B. We are changing the limit to be on the amount of memory used by a transaction. Issues for this include 341, 347, 348 and 365.

The MySQL team at Alibaba reported a performance regression from MyRocks group commit when running sysbench. They also reported a correctness problem that I ran into with sysbench. Auditing the code found another bug. A related problem is that rocksdb_disable_2pc was enabled by default, meaning engine & binlog crash safety wasn't guaranteed. We have changed the default configuration to be crash safe and renamed the option to avoid double negatives. We also used YSCB and sysbench to reproduce the performance regression and have work in progress to make this better. That requires collaboration between the MyRocks and RocksDB teams. It is fun to watch the teams solve problems. Issues for this problem include 474481, 482488.