Small Datum: July 2014

In the beginning there was LevelDB. It is wonderful to see so much functionality provided by a small amount of clean code. The benchmark results show very high performance for a small, cached database without concurrency. AFAIK, this is a great solution for at least one widely-used and important workload. But it isn't a server DBMS and I don't think the project owners want it to become one as that will make the code more complex and even add a few bugs.

Editorial #1

While LevelDB isn't a server DBMS, it is a good basis for one. So we have forks or projects based on it like RocksDB and HyperLevelDB and we are starting to get benchmark results that compare the forks with each other and with non-LevelDB DBMS products. Where there is benchmarking there might be benchmarketing so it helps to understand the client, db_bench, that is frequently used to generate benchmark results.

For starters, my faith in a particular benchmark result is inversely related to the number of products evaluated in the test. It is hard to find someone with expertise in multiple systems to vouch that the result for a specific product wasn't lousy because of misconfiguration or misuse. Results should be explained and that takes more time as more systems are included in a result. While recent changes have made RocksDB better, it is much harder to configure than LevelDB so I won't be surprised when lousy results are caused by misconfiguration. Fortunately, it is easy to share db_bench command lines.

Editorial #2

The RocksDB team has done a lot of great work to improve in-memory performance. When I was helping (a long time ago), we spent a lot of time on IO-bound & write-heavy workloads. Prior to the addition of universal compaction by Dhruba we added monitoring and learned that if you have a write-heavy workload then you are not going to have a good time with LevelDB. We aren't unique in learning that.

These problems described below become visible when you run a random-write workload on a large database and get the LSM into a steady state. It is possible to not reproduce the problem by avoiding some of these steps.

LevelDB suffers from high write-amplification. Add 1 for the redo log, 1 for the memtable flush to the L0, 1 for L0 -> L1 compaction assuming sizeof(L0) == sizeof(L1) and then ~10 for each level beyond the L1 assuming there is no skew in keys to be written (uniform, not zipfian distribution). The above means you might end up with a write-amplification of 50 for a large database. I don't believe a clever choice of compaction inputs leads to a significant reduction in write-amplification in the steady state unless you ignore space-amplification. And if we ignore space-amplification then disabling compaction is the best strategy as data is written once to the L0 and optionally once to the redo log and then we are done. RocksDB has monitoring to report the value and sources of write-amplification.
LevelDB has but one thread to do compaction in the background. This thread does read, decompress, merge keys, write, compress in a loop. It is unlikely to keep up with a few or even one thread on a write-intensive workload. When write-amplification is 50 it must repeat that loop 50 times for each byte put into the memtable. It can also stall on disk reads and writes. RocksDB supports many background compaction threads which is a big improvement from LevelDB and universal compaction to reduce write-amplification.
Storage can also fall behind when write-amplification is large. If storage can sustain 200MB/second of writes and write-amplification is 50 then the peak ingest rate is 4 MB/second (peak-write-rate / write-amplification) and that assumes storage reads are free or not done because the database is cached. A b-tree that does random writes might be able to beat that.
The result of the above is that L0 -> L1 compaction will fall behind and likely trigger many stalls on Put operations. Note that LevelDB doesn't have features to stall Put operations when levels beyond L0 (L1 -> Lmax) get too much data. All of this leads to more Put stalls and an LSM tree with too much data on the wrong levels. RocksDB has an option to stall writes when levels L1 -> Lmax-1 have too much data. But the best solution is to match compaction algorithm to the workload -- universal compaction has less write-amplification.

Notes on db_bench

With all of that said, here is a short and incomplete guide to understanding the LevelDB benchmark client, db_bench. I am a big fan of db_bench. We use it a lot to understand and improve RocksDB. Some of the problems described below have been fixed in RocksDB but others remain.

The db_bench client doesn't report performance metrics per time interval. This has been fixed in RocksDB. LevelDB's db_bench reports average throughput at the end of a test run. It can report percentiles and histograms for response times. For some tests I want to confirm that response time and throughput were stable and did not degrade continuously or intermittently over the duration of the test. That is hard to judge unless throughput and response time are also reported per time interval as the test runs. This is enabled in RocksDB via the --statistics, --stats_per_interval and --stats_interval options.
Client threads reuse the same RNG seed. See the ThreadState constructor in LevelDB's db_bench. There is a --seed option for RocksDB's db_bench to prevent this. The problem is that a given thread generates the same sequence of keys each time the test is run. While that might help with reproducibility it can also ruin results for IO-bound tests in a few cases. By IO-bound I mean the database is much larger than RAM. When read performance is tested for an IO-bound database if the db_bench client is run for a short period of time and then restarted then during the second run all of the read data might still be in the OS filesystem cache. I use iostat during benchmarks to debug this problem and others. When write performance is tested for an IO-bound database and db_bench is run for a short time and then restarted, the same keys are written for each run and more likely to fit in the upper levels of the LSM leading to less write-amplification and better performance than might occur in production. I added compaction statistics to RocksDB to report how much IO is done (bytes read & written, read latencies) per level with leveled compaction and this data can debug the problem.
The db_bench RNG has a range from 0 to 2^31-1. If your db_bench test database has more than 2B distinct key values then you are not going to have a good time. This has been fixed in RocksDB's db_bench.
db_bench doesn't have an option to enable checksum verification on reads from storage. This excludes a CPU overhead that is likely to be enabled on production deployments. See the constructor for ReadOptions which sets verify_checksum to false. Given the use of the default constructor elsewhere in LevelDB I wonder if there are places where it never verifies checksums on read. That would be a bug and I think Table::Open is one example of the bug when ReadBlock is called for the index block. Possible bugs from use of the default constructor have been fixed in RocksDB. RocksDB's db_bench has the --verify_checksum option as well. I didn't ask the LevelDB team about this until I wrote this post.
db_bench doesn't have an option to set the block size for database files, so the default of 4KB is used. RocksDB's db_bench has an option to set the size via the --block_size option.
The statistics timer for Put operations is stopped before DB::Write is done. The timer is stopped in LevelDB's db_bench after the operation has been added to the write batch. It should be stopped after the write batch has been added to the memtable and optionally written to the redo log. See the call to FinishedOps in DoWrite. This means that response times don't measure the time to write the database. This was recently fixed in RocksDB. I didn't ask the LevelDB team about this until I wrote this post.
Was a good malloc library used? I have measured significant differences in the overhead of db_bench source code between not using and using jemalloc.
There is no client-side processing of fetched data after a read. The read loop is very simple -- generate a key, call read, don't look at returned value, repeat. To make sure that the fetched value was really read by the client I think db_bench should compute the sum of the bytes in the returned value if only to generate a bit more load on the memory system. Otherwise a DBMS using mmap reads might avoid copying or reading the values. This has yet to be fixed in RocksDB. See the ReadRandom method in db_bench.
LevelDB and RocksDB use buffered IO. A too large value for readahead (read_ahead_kb) can ruin your results on an IO-bound test. RocksDB's db_bench has the --advise_random_on_open option to use POSIX_FADV_RANDOM for non-compaction reads. It also has the --compaction_fadvice option to use POSIX_FADV_SEQUENTIAL on compaction reads & writes.
When compression is enabled all files are compressed. I am not sure memtable flushes should be compressed because that can use a lot of CPU, especially with zlib, without providing a significant reduction in database size. Beyond the CPU overhead, compressing a memtable flush increases the chance that a Put operation will stall waiting for a memtable flush to finish (flush of the old memtable is done by a background thread when the current memtable is being written). There is an option in RocksDB to limit compression to the lower/larger levels of the LSM with leveled compaction. There is a similar option that works with universal compaction (called size-tiered by Cassandra). For a good overview of leveled and size-tiered compaction see the DataStax posts. The db_bench options are --universal_compression_size_percent and --min_level_to_compress.
Sequential insert tests can be very fast with LevelDB and with RocksDB when using leveled compaction because there is an optimization that avoids the need for compaction. Thus write-amplification is one (or two if the redo log is enabled). The optimization lets memtable flush output get pushed to a lower level of the LSM rather than first being written to the L0 and then slowly compacted from Ln to Ln+1. Querying a database after a sequential load can also benefit from this optimization. RocksDB (and I assume LevelDB) avoids checking a file for data when the desired key is less than the min key or greater than the max key stored in the file. This pruning is done before the bloom filter check. Thus is is likely that at most one file will be checked on a point query after a sequential load. Without the sequential load optimization one file per level would be checked.
A random write test that puts 1M keys into an empty database is unlikely to result in a database with 1M distinct keys. If you do a random query test after this then some of the reads will be very fast assuming the bloom filter works because the desired data doesn't exist.
A short random write test done to a database after a sequential load might not get the database into a steady state. By steady state I mean that compaction should be in progress as it would be for a sustained write-heavy workload. For write-heavy tests I usually do a sequential load of X million rows, then do X million random writes, then consider the database to be ready for real tests.
There are different types of writes. A write-heavy workload with db_bench usually means blind writes but we added the --benchmarks=updaterandom option to db_bench to match the read-modify-write pattern that is common with a SQL DBMS, except db_bench ignores the overhead from making the read-modify-write sequence atomic.
Read QPS is also inflated in db_bench when the memtable is empty as no time must be spent checking an empty memtable. So the pattern -- run db_bench to populate the database, re-run db_bench using a read-only test -- will get QPS much higher than is likely to occur in the real world. And if the load was done using leveled compaction (style used by LevelDB) then the read QPS will be inflated again (see point 11 above).
When RocksDB universal compaction is used (called size-tiered by Cassandra) and a sequential insert is done, then read QPS after the load can also be inflated because of key range pruning. The files after the load will be range partitioned by the index key, so at most one file must be read, and possibly at most one bloom filter check must be done, per query.

Small Datum

Thursday, July 10, 2014

How to answer questions

Preferred version

Less preferred variant #1

Less preferred variant #2

Monday, July 7, 2014

Benchmarking the leveldb family