Small Datum: April 2015

Monday, April 27, 2015

Comparing LevelDB and RocksDB, take 2

I previously explained problems to avoid when comparing RocksDB and LevelDB. I am back with more details and results because someone is wrong on the Internet. The purpose for the test was to determine whether we created any regressions after reading a comparison published by someone else where RocksDB had some problems. Note that the LevelDB and RocksDB projects have different goals. I expect RocksDB to be faster, but that comes at a cost in code and configuration complexity. I am also reluctant to compare different projects in public. The good news is that I didn't find any performance regression in RocksDB, it is faster as expected but the overhead from performance monitoring needs to be reduced.

I made a few changes to LevelDB before running tests. My changes are in github and the commit message has the details. Adding the --seed option for read-heavy tests is important or LevelDB can overstate QPS. The next step was to use the same compiler toolchain for RocksDB and LevelDB. I won't share the diff to the Makefile as that is specific to my work environment.

I use the following pattern for tests. The pattern was repeated for N=1M, 10M, 100M and 1000M with 800 byte values and a 50% compression rate. The database sizes were approximately 512M, 5G, 50G and 500G. The test server has 40 hyperthread cores, 144G of RAM and fast storage.

fillseq to load a database with N keys
overwrite with 1 thread to randomize the database
readwhilewriting with 1 reader thread and the writer limited to 1000 Puts/second. The rate limit is important to avoid starving the reader. Read performance is better when the memtable is empty and when queries are done immediately after fillseq. But for most workloads those are not realistic conditions, thus overwrite was done prior to this test.
readwhilewriting with 16 reader threads and the writer limited to 1000 Puts/second
readrandom with 1 thread
readrandom with 16 threads
overwrite with 1 thread
overwrite with 16 threads

Results

I ran the RocksDB tests twice, with statistics enabled and disabled. We added a lot of monitoring in RocksDB to make it easier to explain performance. But some of that monitoring needs to be more efficient for workloads with high throughput and high concurrency. I have a task open to make this better.

These are command lines for LevelDB, for RocksDB with stats and for RocksDB without stats with 1M keys. There are many more options in the RocksDB command lines. When we decide on better defaults for it then the number can be reduced. This post has more details on the differences in options between LevelDB and RocksDB. There are some differences between LevelDB and RocksDB that I did not try to avoid.

LevelDB uses 2MB files and I chose not to change that in source before compiling. It tries to limit the LSM to 10M in L1, 100M in L2, 1000M in L3, etc. It also uses a 2M write buffer which makes sense given that L0->L1 compaction is triggered when there are 4 files in L0. I configured RocksDB to use a 128M write buffer and limit levels to 1G in L1, 8G in L2, 64 G in L3, etc.
For the 100M and 1000M key test the value of --open_files wasn't large enough in LevelDB to cache all files in the database.
Statistics reporting was enabled for RocksDB. This data has been invaluable for explaining good and bad performance. That feature isn't in LevelDB. This is an example of the compaction IO stats we provide in RocksDB.
Flushing memtables and compaction is multithreaded in RocksDB. It was configured to use 7 threads for flushing memtables and 16 threads for background compaction. This is very important when the background work is slowed by IO and compression latency. And compression latency can be very high with zlib although these tests used snappy. A smaller number would have been sufficient but one thread would have been too little as seen in the LevelDB results. Even with many threads there were stalls in RocksDB. Using this output from the overwrite with 16 threads test look at the Stall(cnt) column for L0 and then the Stalls(count) line. The stalls occur because there were too many L0 files. It is a challenge to move data from the memtable to the L2 with leveled compaction because L0->L1 compaction is single threaded and usually cannot run concurrent with L1->L2 compaction. We have work in progress to make L0->L1 compaction much faster.

Details

The data below shows the QPS (ops/sec) and for some tests also shows the ingest rate (MB/sec). I like to explain performance results but the lack of monitoring in LevelDB makes that difficult. My experience in the past is that it suffers from not having concurrent threads for compaction and memtable flushing especially when the database doesn't fit in RAM because compaction will get more stalls from disk reads.

My conclusions are:

read throughput is a bit higher with RocksDB
write throughput is a lot higher with RocksDB and the advantage increases as the database size increases
worst case overhead for stats in RocksDB is about 10% at high concurrency. It is much less at low concurrency.

--- 1M keys, ~512M of data

RocksDB.stats : RocksDB.nostats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test

231641 181.1 : 243161 190.2 : 156299 121.6 : fillseq

145352 113.7 : 157914 123.5 : 21344 16.6 : overwrite, 1 thread

113814 : 116339 : 73062 : readwhilewriting, 1 thread

850609 : 891225 : 535906 : readwhilewriting, 16 threads

186651 : 192948 : 117716 : readrandom, 1 thread

771182 : 803999 : 686341 : readrandom, 16 threads

148254 115.9 : 152709 119.4 : 24396 19.0 : overwrite, 1 thread

109678 85.8 : 110883 86.7 : 18517 14.4 : overwrite, 16 threads

--- 10M keys, ~5G of data

RocksDB.stats : RocksDB.nostats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test

226324 177.0 : 242528 189.7 : 140095 109.0 : fillseq

86170 67.4 : 86120 67.3 : 12281 9.6 : overwrite, 1 thread

102422 : 95775 : 54696 : readwhilewriting, 1 thread

687739 : 727981 : 513395 : readwhilewriting, 16 threads

143811 : 143809 : 95057 : readrandom, 1 thread

604278 : 676858 : 646517 : readrandom, 16 threads

83208 65.1 : 85342 66.7 : 13220 10.3 : overwrite, 1 thread

82685 64.7 : 83576 65.4 : 11421 8.9 : overwrite, 16 threads

--- 100M keys, ~50GB of data

RocksDB.stats : RocksDB.nostats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test

227738 178.1 : 238645 186.6 : 64599 50.3 : fillseq

72139 56.4 : 73602 57.6 : 6235 4.9 : overwrite, 1 thread

45467 : 47663 : 12981 : readwhilewriting, 1 thread

501563 : 509846 : 173531 : readwhilewriting, 16 threads

54345 : 57677 : 21743 : readrandom, 1 thread

572986 : 585050 : 339314 : readrandom, 16 threads

74292 56.7 : 72860 57.0 : 7026 5.5 : overwrite, 1 thread

74382 58.2 : 75865 59.3 : 5603 4.4 : overwrite, 16 threads

--- 1000M keys, ~500GB of data

Tests are taking a long time...

RocksDB.stats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : test

233126 182.3 : 7054 5.5 : fillseq

65169 51.0 : : overwrite, 1 thread

6790 : : readwhilewriting, 1 thread
72670 : : readwhilewriting, 16 threads

Wednesday, April 22, 2015

Much ado about nothing

Kyle does amazing work with Jepsen and I am happy that he devotes some of his skill and time into making MongoDB better. This week a new problem was reported as stale reads despite the use of the majority write concern. Let me compress the bug report and blog post for you, but first this isn't a code bug this is expected behavior for async replication.

MongoDB implements asynchronous master-slave replication
Commits can be visible to others on the master before oplog entries are sent to the slave

The problem occurs when transaction 1 commits a change on the master, transaction 2 views that change on the master, the master disappears and a slave is promoted to be the new master where the oplog entries from transaction 1 never reached any slave. At this point transaction 1 didn't happen and won't happen on the remaining members of the replica set yet transaction 2 viewed that change. A visible commit has been lost.

When reading MongoDB source I noticed this in early 2014. See my post on when MongoDB makes a transaction visible. I even included a request to update the docs for write concerns and included this statement:

I assume the race exists in that case too, meaning the update is visible on the primary to others before a slave ack has been received.

This isn't a bug, this is async replication. You can fix it by adding support for sync replication. The majority write concern doesn't fix it because that only determines when to acknowledge the commit, it does not determine when to make the commit visible to others. For now the problem might be in the documentation if it wasn't clear about this problem. The majority write concern is a lot like semisync replication in MySQL and then clever people added lossless semisync replication so that commits aren't visible on the master until they have been received by a slave. Finally, really clever people got lossless semisync replication running in production and we were much happier.

Wednesday, April 15, 2015

Big numbers for web scale MySQL

It is conference time. I haven't been at the MySQL UC as I am at another conference (ICDE) so I missed the talks but tweets make strong claims about a successful MySQL deployment.

Tens of PB - maybe that doesn't fit in memory after all
High availability - automated fast failover in less than 30 seconds courtesy of lossless semisync and a lot of hard work from the MySQL PE team. See Fast Master Failover without Data Loss.
170M QPS at peak - sorry, no tweet to share yet

Monday, April 13, 2015

How to win at IO-bound benchmarks

I spend time evaluating database performance on IO-bound workloads and have learned a few things that can help you to get much better results on benchmarks than will occur in production. But this is really about benchmarketing.

The tips are:

Load the database in key order to avoid a fragmented B-Tree index or a randomized LSM tree. If there is a secondary index then create it after the load to avoid fragmentation there. This makes the database smaller and can make searches faster.
Only use 10% of the storage device. This reduces the average seek latency for disks. This also reduces the overhead from flash GC because there is less live data to copy out during erase block cleaning. This also makes it less likely that erase block cleaning will be needed during the test. Erase-block cleaning can also be avoided by starting with a freshly setup or long-idle SSD.
Don't run tests for too long. This reduces the chance that a B-Tree index will become fragmented, that an LSM will get slower from running compaction and that erase block cleaning will be started for an SSD.
Use a fixed number of user threads. When there is a stall there will be at most N(threads) operations that are stalled. In the real world the request arrival rate is usually steady and a stall will create a huge convoy of requests. If you are measuring p99 latency then a fixed number of threads will allow you to understate the impact from stalls. This is called coordinated omission and YCSB has been updated to support a steady arrival rate. Don't use that feature. I think fio will also be updated. I still need to fix innosim.
Use a small number of threads to measure response time and a large number of threads to measure throughput. Make sure that writes and erase block cleaning are not in progress when measure read performance.

For more tips specific to LevelDB and RocksDB see an old post by me.

One other tip. If you are using a benchmark with more than 2B rows/keys/documents make sure that the random number generator supports that. It might not in LevelDB.

Friday, April 10, 2015

Capacity planning and QPS

As much as I like the technology from in-memory DBMS vendors like MemSQL, VoltDB and MySQL Cluster I am amused at some of the marketing that describes capacity planning as a function of QPS. It usually isn't true that when a workload has a peak QPS of X and the peak per-server QPS for the product is Y, then X/Y nodes is sufficient. The Oracle keynote from Mr. Big is a an example of this.

Work in web-scale database land is more complex courtesy of additional constraints including database size, network latency and network bandwidth. Workloads don't require X QPS, they do require X QPS over a database of Y PB with requirements for response time. When Y is large then it helps to not keep all data in RAM. This is made easier when a workload has data access skew so that the hot data can be in RAM while much more data is on disk.

With fast storage there are clever ways to keep indexes in RAM and data on SSD and with proper indexing queries are limited to one or two reads from the storage device. This can provide excellent response times without keeping everything in RAM. The latency from an SSD read might double the query response time compared to an in-memory database which is acceptable for many workloads. Aerospike seems to be good at this but this can also be done with InnoDB (covering indexes and a clustered PK FTW).

Network latency and bandwidth are additional constraints for capacity planning. While they don't determine whether data must be in RAM they might determine how many copies are required for both the database and any caches that accompany the database and by cache I mean something like memcached. With more copies of the data you are likely to have a copy closer to the clients (clients == web tier) and that reduces the network latency added to queries. Even with a database capable of 1B QPS a deployment can require caches to avoid the extra network latency as latency and bandwidth to that fast database server can be a bottleneck.

Tuesday, April 7, 2015

Compared to what?

It is common to share big numbers for web-scale database deployments. I do it frequently in presentations. I am not alone in this practice. Is is easy to get large values for QPS with web-scale OLTP (small data). The same is true for database size and rows read rates with web-scale data warehousing (big data).

I hope that So What? is the first reaction when these big numbers are shared. Big numbers only mean that a lot of hardware has been used. Context is what makes these big numbers more or less interesting. Note that I am not saying this to take away from the work done by my peers. I have been fortunate to work with extremely talented teams.

Compared to what? is another useful response. When considering stock mutual funds we look at the performance of the fund relative to a benchmark such as S&P 500. When considering database performance it helps to understand whether an alternative product would have done better. We usually don't have an answer for this because it can be too expensive to do the comparison, but is still something to keep in mind.

We aren't in the business of growing QPS, database size and rows read rates. We are in the business of answering questions with efficiency and quality of service. The goals include increasing availability, reducing response time, reducing response time variation and doing more work with less HW. Details about these goals are less likely to be shared -- for business reasons and sometimes because the data isn't collected -- so the context required to appreciate the big numbers might always be missing.

Friday, April 3, 2015

Fast index create

What is a fast way to create secondary indexes for a write-optimized database engine?

Back in the day the only way to create a secondary index for InnoDB was via incremental maintenance. Insert, update and delete statements would maintain secondary indexes as needed. The CREATE INDEX command for a secondary index would make a copy of the table, define all indexes (primary and secondary) and then scan the source table in PK order and insert rows from the scan into the copy table while maintaining secondary indexes after each insert. In most cases the result from this is a secondary index subject to changes in a random sequence. That means the secondary index is fully fragmented immediately after index create and there was no way to defragment a secondary index. Fragmentation is bad as it means you are wasting space and using about 1.5X the space for the index compared to the index without fragmentation.

Today InnoDB is able to create a secondary index in the expected way via scan/sort/write and fragmentation is much better. The expected way is to scan the base table or another secondary index to get the columns for the new index, sort those columns and then write out the secondary index in key order. I will ignore the complexity of logging to allow other changes concurrent with the index create. Not only does this reduce fragmentation but it also reduces random IO -- the index is written sequentially and there are no random disk reads to get secondary index leaf pages during the read-modify-write cycle of an update for a b-tree.

The best-practice for fast index creation can change for a write-optimized database engine when the engine supports a blind-write in addition to a read-modify-write internally. By this standard a b-tree is not write-optimized whether it is update-in-place like InnoDB or copy-on-write like WiredTiger. Updates and inserts with a b-tree must read the old copy of the page, modify it and then eventually write the page back to storage. There is always a read before the write even with the InnoDB change buffer where the read is deferred. But with an LSM like RocksDB or WiredTiger and probably with a fractal tree like Tokutek updates can be done as blind-writes in many cases like when the secondary index to be updated is not unique. By blind-write I mean there isn't a read before the write. This means that the random reads that make incremental index maintenance for a b-tree slow can be avoided with a write-optimized engine. The random writes can also be avoided assuming the engine really is write-optimized when flushing changes to storage.

It might take some time for us to learn that the rules have changed. It might also take time for the rules to really change when there are robust write-optimized engines available for MySQL and MongoDB.

Update - today I learned the RocksDB storage engine for MongoDB doesn't do reads for non-unique secondary index maintenance.

Wednesday, April 1, 2015

A few possibly true facts about indexes in MongoDB, RocksDB and WiredTiger

These are probably correct as of April 1, 2015. They might not be correct next year. We have a lot of work to do to explain the storage engines available in MongoDB 3. I welcome corrections.

The primary key index is not clustered in MongoDB for RocksDB or WiredTiger. I assume this is a limitation of the storage engine API and hope a clustered PK index is supported in the future. The PK index is a map from PK columns to RecordId and (I assume) RocksDB and WiredTiger have a clustered index on RecordId and the documents are in the leaf nodes of the RecordId index. The impact from not having a clustered PK index include:

Much less RAM is required to cache the PK index because it is not clustered but if the table is huge and you want to guarantee that at most one disk read is required for a point query then a non-clustered PK requires much more RAM then a clustered PK. With the clustered PK you only need the level above the leaf in RAM and that only requires 1 (key, block pointer) pair per block in the leaf level. With the non-clustered PK you need one (key, block pointer) pair per row in the table.
For RocksDB and WiredTiger going from RecordId to the document requires searching an index (b-tree or LSM for WT, LSM for RocksDB). An index search is not required for mmapv1 as the RecordId is a filename + offset in that case. This might make read-mostly workloads on a cached working set faster with mmapv1.
If values adjacent in the PK index are not adjacent in the RecordId index then a scan in PK order can do a lot of random IO as it moves around in the RecordId index.

The WiredTiger b-tree can use prefix compression for indexes and block compression for data. I think this means that block compression is used for the leaf nodes of the RecordId index. I am not sure if it is also used for the leaf nodes of the other indexes. I guess that only prefix compression is used for the non-leaf nodes of all indexes.
With RocksDB we need to be careful about terminology. With leveled compaction the block-based table format is used for MongoDB and most of the data in an LSM level is from pages, and then the block index and bloom filter. This is true for secondary indexes, primary indexes and the RecordId index. Currently all data is in one column family and indexes are distinguished by using the index ID as the prefix of the key given to RocksDB but we can ignore that for now. Lets also ignore bloom filters and just focus on pages and the block index. The interesting details are:

There is one key per page in the block index so this is unlikely to use much space compared to the pages.
Key value pairs are stored in the pages in key order and prefix compression is done for them. Prefix compression is not done for the block index. Logically the value in the secondary index is the PK columns but the PK columns are used here to make the secondary index key unique. Logically the value in the PK index is the RecordId, but I think that the implementation puts the RecordId at the end of the PK columns in the key. And for the RecordId index the value is the Document.
Block compression, when configured, is done for all pages. So it might be possible that more data can get the benefit from block compression with RocksDB than with WiredTiger and this can be a big deal when wide (covering) secondary indexes are used. However I still have questions about where the block compressor is used for WT.