Small Datum: April 2014

Wednesday, April 23, 2014

Concurrent, read-only, not cached: MongoDB, TokuMX, MySQL

I repeated the tests described here using a database larger than RAM. The test database has 8 collections/tables with 400M documents/rows per table. I previously reported results for this workload using a server with 24 CPU cores and a slightly different flash storage device. This time I provide a graph and use a server with more CPU cores. The goal for this test is to determine whether the DBMS can use the capacity of a high-performance storage device, the impact from different filesystem readahead settings for MongoDB and TokuMX and the impact from different read page sizes for TokuMX and InnoDB. It will take two blog posts to share everything. I think I will have much better QPS for MongoDB and TokuMX in my next post so I won't list any conclusions here.

Setup

I used my forked Java and C sysbench clients. The test query fetches one document/row by PK. The test database has 8 collections/tables with 400M rows per collection/table. All are in one database. I still need to enhance the Java sysbench client to support a database per collection. I tested the configurations listed below. I don't think these are the best configurations for TokuMX and MongoDB and am running more tests to confirm. The test server has 144G RAM, 40 CPU cores and a fast flash storage device.

fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=8k, data fetched via HANDLER
fb56.sql - 740G database, MySQL 5.6.12 with the Facebook path, InnoDB, page_size=8k, data fetched via SELECT
orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via HANDLER.
orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via SELECT
tokumx32 - 554G database, TokuMX 1.4.1, quicklz, readPageSize=32K, 16K filesystem readahead
tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K, 32K filesystem readahead
mongo24 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0, 16K filesystem readahead
mongo26 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1, 16K filesystem readahead

Results

Results for MySQL 5.7.4 are not in the graph to keep it readable and are similar to MySQL 5.6.12. Note that MySQL is able to get more than 100,000 QPS at high concurrency, TokuMX reaches 30,000 and MongoDB isn't able to reach 20,000. I think MongoDB and TokuMX can do a lot better when I reduce the filesystem readahead for both and reduce the read page size for TokuMX and results for that are in my next post. MongoDB also suffers in this test because the PK index is so large that all leaf nodes cannot fit in RAM so there is more than one disk read per query. This isn't something that goes away via tuning. The workaround it to make sure the database:RAM ratio isn't too big (and spend more money on hardware).

This lists the QPS from the graph.

point queries per second
8 16 32 40 clients
39928 63542 102294 107769 fb56.handler
33630 56834 91132 102336 fb56.sql
39714 63359 101987 106205 orig57.handler
33561 56725 90900 101476 orig57.sql
12586 22738 31407 32167 tokumx32
10119 16373 18310 18232 tokumx64
12782 16639 17350 17435 mongo24
12503 17474 17988 18022 mongo26

Analysis

These tables list the average disk read rate from iostat r/s and the average number of disk reads per query. InnoDB is by far the most efficient with the smallest number of disk reads per query. TokuMX benefits from having the smallest database courtesy of quicklz compression but might suffer from a larger read page size (32k and 64k). But I don't think that is the only reason why the disk reads per query ratio is so much larger than InnoDB and TokuMX. I am repeating tests with an 8k read page size to confirm. MongoDB suffers from a PK index that is too large to be cached so disk reads are done for it and the document store. Both TokuMX and MongoDB might also do extra reads because of the filesystem readahead and I am repeating tests with smaller values for it to confirm.

iostat r/s
8 16 32 40 clients
33661 53502 86028 90616 fb56.handler
29120 49155 78748 88423 fb56.sql
33776 53702 86193 89755 orig57.handler
29244 49268 78801 88027 orig57.sql
26756 47813 65885 67840 tokumx32
23728 37442 41357 42089 tokumx64
18966 24440 25147 25322 mongo24
18312 25313 25701 25781 mongo26

disk reads per query
8 16 32 40 clients
0.84a 0.84 0.84 0.84 fb56.handler
0.86 0.86 0.86 0.86 fb56.sql
0.85 0.84 0.84 0.84 orig57.handler
0.87 0.86 0.86 0.86 orig57.sql
2.12 2.10 2.09 2.10 tokumx32
2.34 2.28 2.25 2.29 tokumx64
1.48 1.46 1.44 1.45 mongo24
1.54 1.44 1.42 1.43 mongo26

RW locks are hard

MongoDB and TokuMX saturated at a lower QPS rate then MySQL when running read-only workloads on a cached database with high concurrency. Many of the stalls were on the per-database RW-lock and I was curious about the benefit from removing that lock. I hacked MongoDB to not use the RW-lock per query (not safe for production) and repeated the test. I got less than 5% more QPS at 32 concurrent clients. I expected more, looked at performance with PMP and quickly realized there were several other sources of mutex contention that are largely hidden by contention on the per-database RW-lock. So this problem won't be easy to fix but I think it can be fixed.

The easy way to implement a reader-writer lock uses the pattern listed below. That includes pthread_rwlock_t in glibc the last time I checked and the per-database RW-lock used by MongoDB. InnoDB used this pattern many years ago and then we rewrote it to make InnoDB better on multi-core. An implementation like this tends to have problems on multi-core servers. The first problem is from locking/unlocking the internal mutex at least twice per use, once to get it in read or write mode and then again to unlock it. When there is contention it can be locked/unlocked many more times than twice per use from threads that wait, wake-up and then wait again. If the operation protected by this RW-lock is very fast then a mutex is usually a better choice. Note that even when all threads are trying to lock in read mode there is still contention on the internal mutex ("mtx" below). Another problem occurs when the thread trying to unlock a RW-lock is blocked trying to lock the internal state mutex ("mtx" below). There might be other threads waiting to run as soon as the unlock gets through but the unlock is stalled because incoming lock requests are competing for the same mutex ("mtx"). I have seen many PMP thread stacks where the unlocking thread is stuck on the lock_mutex call.

lock(mode)
lock_mutex(mtx)
wait_until_lock_granted(mode)
modify_lock_state()
unlock_mutex(mtx)

unlock()
lock_mutex(mtx)
modify_lock_state()
notify_some_waiters()
unlock_mutex(mtx)

Something better

The alternative that scales better is to use a lock-free approach to get and set internal state in the RW-lock. We did this as part of the Google MySQL patch many years ago and that code was contributed upstream. Such an approach removes much of the contention added by an inefficient RW-lock. It won't prevent contention added because threads want the lock in read and write mode at the same time. That still requires some threads to wait. When we did the work at Google on the InnoDB RW-lock, Yasufumi Kinoshita was working on a similar change. I am very happy he continues to make InnoDB better.

A lock-free implementation for a DBMS is likely to be much more complex than what you might read about on the web or a top-tier systems conference paper. There is more complexity because of the need to support performance monitoring, manageability, special semantics and the occasional wrong design decision. For performance monitoring we need to know how frequently a lock is used and how long threads wait on it. For manageability we need to know what threads wait on a lock and which thread holds it. A frequent pattern is for today's special semantics to become tomorrow's design decisions that we regret. But we can't expect perfection given the need to move fast and the rate at which hardware changes.

The low-level reader-writer lock in MongoDB, QLock, is a RW-lock with special semantics. It has two modes each for read and write locks: r, R, w and W. It also supports upgrades and downgrades: W to R, R to W, w to X and X to w (I didn't mention X above). Internally there are 6 condition variables, one each for r, R, w and W and then two others, U and X, to support upgrades and downgrades. Read the source for more details. I don't understand the code enough to guess whether lock-free state changes can be supported as they were for the InnoDB RW-lock.

MongoDB details

I spent a few hours browsing the source for the MongoDB RW-lock and these are my notes. I hope they help you, otherwise they will be a reference for me in the future. Queries that call find to fetch one row by PK start to run in mongod via the newRunQuery function. That gets the per-database RW-lock in read mode by creating a Client::ReadContext object on the stack and ReadContext gets the per-database RW-lock in read mode:

/** "read lock, and set my context, all in one operation"

* This handles (if not recursively locked) opening an unopened database.

Client::ReadContext::ReadContext(const string& ns, const std::string& path) {

{

lk.reset( new Lock::DBRead(ns) );

Database *db = dbHolder().get(ns, path);

if( db ) {

c.reset( new Context(path, ns, db) );

return;

}

...

The dbHolder().get() call above locks a mutex in DatabaseHolder while using the database name to find the database object. There is simple string searching while the mutex is locked. It might be easy to move some of that work outside the scope of the mutex and perhaps use a mutex per hash table bucket.

Database * get( const string& ns , const string& path ) const {
SimpleMutex::scoped_lock lk(_m);
Lock::assertAtLeastReadLocked(ns);
Paths::const_iterator x = _paths.find( path );
if ( x == _paths.end() )
return 0;
const DBs& m = x->second;
string db = _todb( ns );
DBs::const_iterator it = m.find(db);
if ( it != m.end() )
return it->second;
return 0;
}

static string __todb( const string& ns ) {
size_t i = ns.find( '.' );
if ( i == string::npos ) {
uassert( 13074 , "db name can't be empty" , ns.size() );
return ns;
}
uassert( 13075 , "db name can't be empty" , i > 0 );
return ns.substr( 0 , i );

}

Lets get back to the DBRead constructor that was called in the ReadContext constructor above. It calls lockDB to do the real work. The code below will call other functions that lock mutexes but no mutex is held by the caller to the code below. In my case the block with "if (DB_LEVEL_LOCKING_ENABLED)" is entered and lockTop gets called to do the real work.

Lock::DBRead::DBRead( const StringData& ns )
: ScopedLock( 'r' ), _what(ns.toString()), _nested(false) {
lockDB( _what );
}

void Lock::DBRead::lockDB(const string& ns) {
fassert( 16254, !ns.empty() );
LockState& ls = lockState();

Acquiring a(this,ls);
_locked_r=false;
_weLocked=0;

if ( ls.isRW() )
return;
if (DB_LEVEL_LOCKING_ENABLED) {
StringData db = nsToDatabaseSubstring(ns);
Nestable nested = n(db);
if( !nested )
lockOther(db);
lockTop(ls);
if( nested )
lockNestable(nested);
}
else {
qlk.lock_R();
_locked_r = true;
}
}

Well, lockTop doesn't do the real work during my benchmark. It calls qlk.lock_r to do that.

void Lock::DBRead::lockTop(LockState& ls) {

switch( ls.threadState() ) {

case 'r':

case 'w':

break;

default:

verify(false);

case 0 :

qlk.lock_r();

_locked_r = true;

}

Almost there, just one more level of indirection. The call to qlk.lock_r calls the lock_r method on an instance of QLock and then something gets done.

void lock_r() {

verify( threadState() == 0 );

lockState().lockedStart( 'r' );

q.lock_r();

}

inline void QLock::lock_r() {

boost::mutex::scoped_lock lk(m);

while( !r_legal() ) {

r.c.wait(m);

}

r.n++;

}

Eventually the unlock_r method is called for the same instance of QLock. I won't show the route there however.

inline void QLock::unlock_r() {

boost::mutex::scoped_lock lk(m);

fassert(16137, r.n > 0);

--r.n;

notifyWeUnlocked('r');

}

And notifyWeUnlocked provides the special semantics. This includes not letting a new reader in when there is a pending write request. The code below also wakes all waiting write requests when one is waiting. This might cause many threads to be scheduled to run even though at most one will get the RW-lock. InnoDB does something similar.

inline void QLock::notifyWeUnlocked(char me) {

fassert(16201, W.n == 0);

if ( me == 'X' ) {

X.c.notify_all();

}

if( U.n ) {

// U is highest priority

if( (r.n + w.n + W.n + X.n == 0) && (R.n == 1) ) {

U.c.notify_one();

return;

}

if ( X_legal() && i_block(me, 'X') ) {

X.c.notify_one();

}

if ( W_legal() && i_block(me, 'W') ) {

W.c.notify_one();

if( _areQueueJumpingGlobalWritesPending() )

return;

}

if ( R_legal_ignore_greed() && i_block(me, 'R') ) {

R.c.notify_all();

}

if ( w_legal_ignore_greed() && i_block(me, 'w') ) {

w.c.notify_all();

}

if ( r_legal_ignore_greed() && i_block(me, 'r') ) {

r.c.notify_all();

}

Tuesday, April 22, 2014

Concurrent, read-only & cached: MongoDB, TokuMX, MySQL

This has results for a read-only workload where all data is cached. The test query fetches all columns in one doucment/row by PK. For InnoDB all data is in the buffer pool. For TokuMX and MongoDB all data is in the OS filesystem cache and accessed via mmap'd files. The test server has 40 CPU cores with HT enabled and the test clients share the host with mysqld/mongod to reduce variance from network latency. This was similar to a previous test, except the database is in cache and the test host has more CPU cores. The summary of my results is:

MongoDB 2.6 has a performance regression from using more CPU per query. The regression might be limited to simple queries that do single row lookups on the _id index. I spent a bit of time rediscovering how to get hierarchical CPU profile data from gperftools to explain this. JIRAs 13663 and 13685 are open for this.
MySQL gets much more QPS at high concurrency than MongoDB and TokuMX
MySQL gets more QPS using the HANDLER interface than SELECT. I expect the InnoDB memcached API to be even faster than HANDLER but did not test it.
MySQL uses more CPU per query in 5.7.4 than in 5.6.12 but this didn't have an impact on QPS

Setup

The test was repeated for 1, 2, 4, 8, 16, 32 and 40 concurrent clients. It uses my forked versions of the MongoDB and C clients for sysbench. There are 8 collections/tables in one database. Each table has 400M rows but queries are limited to the first 1M. I don't know yet whether using a database per collection would improve the MongoDB results. Each query fetches all columns in one document/row by PK. I have yet to push my changes to the MongoDB sysbench client to make it fetch all columns. I tested these binaries:

fb56.handler - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses HANDLER to fetch data.
fb56.sql - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses SELECT to fetch data.
orig57.handler - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses HANDLER to fetch data.
orig57.sql - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses SELECT to fetch data.
tokumx - TokuMX 1.4.1 using quicklz and 32kb pages. There should be no decompression during the test as all data used by the test (1M documents) is much smaller than 50% of RAM.
mongo24 - MongoDB 2.4.9
mongo26 - MongoDB 2.6.0

Results

At last I included a graph. I have been reluctant to include graphs on previous posts comparing MongoDB, TokuMX and MySQL because I want to avoid benchmarketing and drive-by analysis. These tests have been time consuming to run and document and I don't want to make it too easy to misinterpret the results. Results for MySQL 5.7.4 are not in the graph to make it easier to read. The top two bars (blue & red) are for MySQL and you can see that QPS increases with more concurrency. QPS for MongoDB and TokuMX saturates at a lower level of concurrency.

Numbers used for the graph above.

point queries per second
1 2 4 8 16 32 40 clients
17864 32397 60294 106374 184566 298276 350665 fb56.handler
11730 22884 39646 73485 131533 215487 249402 fb56.sql
18161 33262 59413 107505 185894 306084 371045 orig57.handler
11775 21838 40528 75322 135331 227450 266917 orig57.sql
14298 25219 45743 83214 142489 168498 161840 tokumx
17203 30158 52476 94705 161922 174453 170177 mongo24
10705 19502 34318 61977 109684 152667 151555 mongo26

Analysis

I used vmstat to measure the average CPU utilization (user + system) during the test. The numbers below are: (CPU_utilization / QPS) * 1,000,000. There are some interesting details.

the values are larger for MySQL 5.7 than for 5.6 at low concurrency. Note that in both cases the performance schema was disabled at compile time.
the values are much larger for MongoDB 2.6 than for 2.4 and hopefully this can be fixed via JIRAs 13663 and 13685.

(CPU_utilization / QPS) * 1,000,000

1 2 4 8 16 32 40 clients

218 197 197 208 216 251 268 fb56.handler

323 310 287 298 304 352 372 fb56.sql

357 279 240 216 215 248 250 orig57.handler

407 380 313 288 302 342 359 orig57.sql

272 269 251 254 266 302 296 tokumx

232 215 219 225 234 257 252 mongo24

373 333 340 342 355 425 422 mongo26

I also used vmstat to measure the context switch rate and the table below lists the number of context switches per query. Note that the rate decreases with concurrency for MySQL but not for MongoDB and TokuMX. I don't know enough about Linux internals to interpret this.

vmstat.cs / QPS

context switch per query

1 2 4 8 16 32 40 clients

4.44 4.14 4.01 3.79 3.47 3.05 2.19 fb56.handler

4.61 4.32 4.03 3.84 3.59 3.23 2.65 fb56.sql

4.53 4.27 4.07 3.88 3.52 3.08 2.20 orig57.handler

4.81 4.48 4.19 3.96 3.63 3.07 2.19 orig57.sql

4.59 4.30 4.08 3.87 3.77 4.32 4.32 tokumx

4.54 4.23 4.03 3.84 3.79 4.29 4.30 mongo24

4.80 4.43 4.21 3.99 3.93 4.58 4.63 mongo26

Hierarchical CPU profiling for MongoDB

One day it will be easy to get hierarchical CPU profile results for open-source databases using open-source profiling tools. Support for CPU profiling via Google perftools can be compiled into MongoDB via the --use-cpu-profiler option. Given the use of a compiler toolchain in a nonstandard location I also used --extrapath and --extralib to help it find libunwind. However, the profiler output file was mostly empty after doing this and did not have any per-thread results.

For future reference, profiling was started and stopped via:

echo "db.runCommand({ _cpuProfilerStart: { profileFilename: '/path/to/output' } })" | bin/mongo admin
echo "db.runCommand({ _cpuProfilerStop:1})" | bin/mongo admin

Eventually I remembered that it might help to add a call to ProfilerRegisterThread() at the start of each new thread. I did something similar in the Google patch for MySQL many years ago. And now I have hierarchical CPU profiles to help understand the source of performance regressions in MongoDB 2.6. I updated JIRA 6628 with details from this and then was asked to create JIRA 13683. The change is to handleIncomingMsg():
static void* handleIncomingMsg(void* arg) {
TicketHolderReleaser connTicketReleaser( &Listener::globalTicketHolder );
::ProfilerRegisterThread(); // Add this

Friday, April 18, 2014

Biebermarks

Yet another microbenchmark result. This one is based on behavior that has caused problems in the past for a variety of reasons which lead to a few interesting discoveries. The first was that using a short lock-wait timeout was better than the InnoDB deadlock detection code. The second was that no-stored procedures could overcome network latency.

The workload is a large database where all updates are done to a small number of rows. I think it is important to use a large database to include the overhead from searching multiple levels of a b-tree. The inspiration for this is maintaining counts for popular entities like Justin Bieber and One Direction. This comes from serving the social graph. For more on that read about TAO and LinkBench.

The most popular benchmark for MySQL is sysbench and it is usually run with a uniform distribution so that all rows are equally likely to be queried or modified. But real workloads have skew which can cause stress in unexpected places and I describe one such place within InnoDB from this microbenchmark. YCSB and LinkBench are benchmarks that have skew and can be run for MySQL. I hope that more of the MySQL benchmark results in the future include skew.

Configuration

See a previous post for more details. Eight collections/tables with 400M documents/rows per collection/table were created. All collections/tables are in one database so MongoDB suffers from the per-database RW-lock. But MySQL and TokuMX also suffer from a similar issue when all clients are trying to update the same row. Tests were run for 1, 2, 4 and 8 tables where one row per table was updated. So when the test used 4 tables there were 4 rows getting updates. For each number of tables tests were run for up to 64 concurrent clients/threads. The result tables listed in the next section should make that clear.

The workload is updating the non-indexed column of one document/row by PK per transaction. There are no secondary indexes on the table. In this case the document/row with ID=1 is chosen for every update. For MySQL and TokuMX an auto-commit transaction is used. The journal (redo log) is used but the update does not wait for the journal/log to be forced to disk. The updates should not require disk reads as all relevant index and data blocks remain in cache. TokuMX might do reads in the background to maintain fractal trees but I don't understand their algorithm to be certain.

The database was loaded in PK order and about 8 hours of concurrent & random updates were done to warmup the database prior to this test. The warmup was the same workload as described in a previous post.

The MySQL test client limits clients to one table. So when there are 64 clients and 8 tables then there are 8 clients updating the 1 row per table. The MongoDB/TokuMX client does not do that. It lets all clients update all tables so in this case there are at most 64 clients updating the row per table and on average there would be 8.

The test server has 40 CPU cores with HT enabled, fast flash storage and 144G of RAM. The benchmark client and database servers shared the host. Tests were run for several configurations:

mongo26 - MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
mongo24 - MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
mysql - MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
toku-32 - TokuMX 1.4.1, readPageSize=32k, quicklz compression, logFlushPeriod=300, w:1,j:0. I don't have results for toku-32 yet.
toku-64 - TokuMX 1.4.1, readPageSize=64k, quicklz compression, logFlushPeriod=300, w:1,j:0

Results per DBMS

I first list the results by DBMS to show the impact from spreading the workload over more rows/tables. The numbers below are the updates per second rate. I use "DOP=X" to indicate the number of concurrent clients and "DOP" stands for Degree Of Parallelism (it is an Oracle thing). A few conclusions from the results below:

MySQL/InnoDB does much better with more tables for two reasons. The first is that it allows for more concurrency. The second is that it avoids some of the overhead in the code that maintains row locks and threads waiting for row locks. I describe that in more detail at the end of this post.
MongoDB 2.4.9 is slightly faster than 2.6.0rc2. I think the problem is that mongod requires more CPU per update in 2.6 versus 2.4 and this looks like a performance regression in 2.6 (at least in 2.6.0rc2). I am still profiling to figure out where. More details on this are at the end of the post. I filed JIRA 13663 for this.
MongoDB doesn't benefit from spreading the load over more collections when all collections are in the same database. This is expected given the per-database RW-lock.

Updates per second
config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 1 8360 15992 30182 24932 23924 23191 21048
mysql 2 X 16527 30824 49999 41045 40506 38357
mysql 4 X X 32351 51791 67423 62116 59137
mysql 8 X X X 54826 80409 73782 68128

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64

mongo24 1 10212 17844 30204 34003 33895 33564 33451

mongo24 2 X 10256 17698 30547 34125 33717 33573

mongo24 4 X X 10670 17690 30903 34027 33586

mongo24 8 X X X 10379 17702 30920 33758

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64

mongo26 1 9187 16131 27648 28506 27784 27437 27021

mongo26 2 X 9367 16035 27490 28326 27746 27354

mongo26 4 X X 9179 16028 27666 28330 27647

mongo26 8 X X X 9125 16038 27275 27858

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64

toku-64 1 7327 12804 16179 12154 11021 9990 8344

toku-64 2 X 7173 12690 20483 23064 22354 20349

toku-64 4 X X 7191 12943 21399 33485 40124

toku-64 8 X X X 7121 12727 22096 38207

Results per number of tables

This reorders the results from above to show them for all configurations at the same number of tables. You are welcome to draw conclusions about which is faster.

Updates per second
config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 1 8360 15992 30182 24932 23924 23191 21048
mongo24 1 10212 17844 30204 34003 33895 33564 33451
mongo26 1 9187 16131 27648 28506 27784 27437 27021

toku-64 1 7327 12804 16179 12154 11021 9990 8344

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 2 X 16527 30824 49999 41045 40506 38357
mongo24 2 X 10256 17698 30547 34125 33717 33573
mongo26 2 X 9367 16035 27490 28326 27746 27354
toku-64 2 X 7173 12690 20483 23064 22354 20349

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 4 X X 32351 51791 67423 62116 59137
mongo24 4 X X 10670 17690 30903 34027 33586
mongo26 4 X X 9179 16028 27666 28330 27647
toku-64 4 X X 7191 12943 21399 33485 40124

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 8 X X X 54826 80409 73782 68128
mongo24 8 X X X 10379 17702 30920 33758
mongo26 8 X X X 9125 16038 27275 27858
toku-64 8 X X X 7121 12727 22096 38207

Row locks for InnoDB

I used PMP to understand MySQL/InnoDB on this workload. I frequently saw all user threads blocked on a condition variable with this stack trace. It seems odd that all threads are sleeping. I think the problem is that one thread can run but has yet to be scheduled by Linux. My memory of the row lock code is that it wakes threads in FIFO order and when N threads wait for a lock on the same row then each thread waits on a separate condition variable. I am not sure if this code has been improved in MySQL 5.7. A quick reading of some of the 5.6.12 row lock code showed many mutex operations. Problems in this code have escaped scrutiny in the past because much of our public benchmark activity has used workloads with uniform distributions.

pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low2,lock_wait_suspend_thread,row_mysql_handle_errors,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection

This was a less frequent stack trace from the test ...

lock_get_mode,lock_table_other_has_incompatible,lock_table,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection

Row locks for TokuMX

TokuMX has a similar point at which all threads wait. It isn't a big surprise given that both provide fine-grained concurrency control but there is no granularity finer than a row lock.

pthread_cond_timedwait@@GLIBC_2.3.2,toku_cond_timedwait,toku::lock_request::wait,toku_db_wait_range_lock,toku_c_getf_set(__toku_dbc*,,db_getf_set,autotxn_db_getf_set(__toku_db*,,mongo::CollectionBase::findByPK(mongo::BSONObj,mongo::queryByPKHack(mongo::Collection*,,mongo::updateObjects(char,mongo::lockedReceivedUpdate(char,mongo::receivedUpdate(mongo::Message&,,mongo::assembleResponse(mongo::Message&,,mongo::MyMessageHandler::process(mongo::Message&,,mongo::PortMessageServer::handleIncomingMsg(void*)

MongoDB 2.4 versus 2.6

I get about 1.2X more updates/second with MongoDB 2.4.9 compared to 2.6.0rc2. I think the problem is that 2.6 uses more CPU per update. I filed JIRA 13663 for this but am still trying to profile the code. So far I know the following all of which indicates that the 2.4.9 test is running 1.2X faster than 2.6.0rc2 with 32 client threads and 1 table:

I get ~1.2X more updates/second with 2.4.9
the Java sysbench client uses ~1.2X more CPU per "top" with 2.4.9
the context switch rate is ~1.2X higher with 2.4.9

The interesting point is that mongod for 2.4.9 only uses ~1.03X more CPU than 2.6.0rc2 per "top" during this test even though it is doing 1.2X more updates/second. So 2.6.0rc2 uses more CPU per update. I will look at "perf" output. I can repeat this with the GA version of 2.6.

Wednesday, April 16, 2014

TokuMX, MongoDB and InnoDB, IO-bound update-only with fast storage

I repeated the update-only IO-bound tests using pure-flash servers to compare TokuMX, MongoDB and InnoDB. The test setup was the same as on the pure-disk servers except for the hardware. In this case the servers have fast flash storage, 144G of RAM and 24 CPU cores with HT enabled. As a reminder, the InnoDB change buffer and TokuMX fractal tree don't help on this workload because there are no secondary indexes to maintain. Note that all collections/tables are in one database for this workload thus showing the worst-case for the MongoDB per-database RW-lock. The result summary:

InnoDB is much faster than MongoDB and TokuMX. This test requires a high rate of dirty page writeback and thanks to a lot of work from the InnoDB team at MySQL with help from Percona and Facebook (and others) the InnoDB engine is now very good at that. Relative to MongoDB, InnoDB also benefits from a clustered PK index.
MongoDB is much slower than InnoDB for two reasons. First it doesn't have a clustered PK index so it might do storage reads for both the index search and then while reading the document. The second reason is the per-database RW-lock. As I described previously this lock appears to be held during disk reads when the index is searched so at most one thread searches the index at a time even though there are concurrent update requests. I created JIRA 3177 to make that obvious in the documentation. Because of this the peak rate for MongoDB is approximately the number of reads per second that one thread can do from the flash device. The device can sustain many more reads/second with concurrency but MongoDB doesn't get much benefit from it. I think there will be at most 2 concurrent flash/disk reads at any time -- one while searching the index and the other while prefetching the document into RAM after releasing the per-database RW-lock in Record::touch.
TokuMX also benefits from the clustered PK index but it suffers from other problems that I was unable to debug. I think it can do much better once a Toku expert reproduces the problem on their hardware.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has fast flash storage that can do more than 5000 reads/second from one thread and more than 50,000 reads/second from many threads. The server also has 24 CPU cores with HT enabled and 144G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested the configurations listed below. I ran tests for more configurations but forgot to adjust read_ahead_kb so I won't publish results from those hosts.

mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0

mysql-4k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=4k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0

mysql-8k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0

tokumx-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0

Results

That probably isn't a typo below. InnoDB sustained about 5 to 10 times more updates/second. MongoDB does many more disk reads per update which is similar to the pure-disk results. I don't have the expertise to explain why TokuMX results weren't better but I shared information with the Tokutek team. Bytes written to storage per update is listed for InnoDB to show the impact on the write rate from using a smaller page. That can be important when flash endurance must be improved.

TPS
configuration 8 clients 16 clients 32 clients 64 clients
mysql-8k 24834 33886 37573 40198
mysql-4k 24826 31704 34644 34987
tokumx-quicklz 3706 3950 3552 3357
mongo-p2y 5194 5167 5173 5102

disk reads/second from iostat r/s
configuration 8 clients 16 clients 32 clients 64 clients

mysql-8k 20995 28371 31397 33537

mysql-4k 22016 27985 30553 30972

tokumx-quicklz 4943 5641 4962 4783

mongo-p2y 8960 8921 8951 8859

disk reads per update

configuration 8 clients 16 clients 32 clients 64 clients

mysql-8k 0.85 0.84 0.84 0.83

mysql-4k 0.89 0.88 0.88 0.89

tokumx-quicklz 1.33 1.43 1.40 1.42

mongo-p2y 1.73 1.73 1.73 1.74

bytes written per update

configuration 8 clients 16 clients 32 clients 64 clients

mysql-8k 6.56 6.40 5.36 5.36

mysql-4k 3.86 3.72 3.76 3.78

Types of writes

What does it mean to make writes fast? It helps to distinguish between the different types of writes. The slowest is a write that must be implemented as read-modify-write. This might require a disk read and can also create contention from preventing concurrent changes to the row for the duration of the read, modify and write. The row might not be unlocked until the change is made durable on storage (commit, fsync, etc) which lets you estimate the peak rate at which a single row can be changed on a traditional DBMS. And this latency between changes can get even worse when there is sync replication or multiple client-server round trips per transaction. The UPDATE statement in SQL is usually implemented as read-modify-write. Some DBMS engines require locking to be done above the DBMS because they don't support locking across operations where read and write are separate operations (RocksDB is an example). Other DBMS engines compensate for that with a conditional put that performs a write when parts of the row have not changed like checkAndPut in HBase. But if the client does a read prior to the write then the overhead from the read still exists.

Some UPDATE statements perform changes that are commutative and it is possible to avoid the read prior to the write. That optimization is rarely implemented but it is possible in RocksDB with the merge operator, TokuDB, and Redis. Increment and decrement are examples of commutative operations. But this also requires upsert behavior to handle the case where the row to be changed doesn't exist. If the read is to be skipped it also means that a result cannot be returned -- that is the count of rows changed or the old/new value of the changed column.

A blind-write is the easiest write to make fast. This can be done via a Put call with a key-value store like LevelDB or RocksDB. Some SQL operations can also use a blind-write if we have an option to not return the count of changed rows when the statement is executed and the operation behaves like an upsert. This is rarely implemented but TokuDB might help people appreciate it.

So there are at least 3 types of writes and from slowest to fastest they are read-modify-write, commutative-write, blind-write. Hopefully these optimizations will become more common in new database engines. From the write operation there shouldn't be a performance difference between commutative-write and blind-write. But the query latency for a row subject to commutative-write can be much worse than for blind-write because many old updates might have to be read and merged.

More details on disk IO-bound, update only for MongoDB, TokuMX and InnoDB

This has a few more details on the results for update-only sysbench using a disk IO-bound workload. I describe the impact from changing innodb_flush_neighbors. The parameter can be set to write back some dirty pages early when other pages in the same extent must be written back. The goal is to reduce the number of disk seeks consumed by page writeback and this can help on disk based servers.

There might be a small impact from changing innodb_flush_neighbors on this workload from both the TPS results and the amount of data written to disk per update. In a previous blog post I explained the impact of this parameter on the insert benchmark for pure-disk servers. The benefit there was much greater than here. I think there are fewer dirty pages per extent in this workload because the database is much larger than RAM so the feature is less likely to be used.

The test configuration is described in a previous post. The only difference here is that I repeated the test for innodb_flush_neighbors set to 0, 1 and 2.

TPS
configuration 8 clients 16 clients 32 clients 64 clients
flush_neighbors=0 498 677 926 969
flush_neighbors=1 506 692 848 1004
flush_neighbors=2 543 737 913 1043

KB/update written to disk
configuration 8 clients 16 clients 32 clients 64 clients
flush_neighbors=0 7.95 6.75 5.88 6.55
flush_neighbors=1 7.91 6.71 6.53 6.50
flush_neighbors=2 7.78 6.68 6.68 6.74

Tuesday, April 15, 2014

MongoDB, TokuMX and InnoDB for disk IO-bound, update-only by PK

I used sysbench to measure TPS for a workload that does 1 update by primary key per transaction. The database was much larger than RAM and the server has a SAS disk array that can do at least 2000 IOPs with a lot of concurrency. The update is to a non-indexed column so there is no secondary index maintenance which also means there is no benefit from a fractal tree in TokuMX or the change buffer in InnoDB. I also modified the benchmark client to avoid creating a secondary index. Despite that TokuMX gets almost 2X more TPS than InnoDB and InnoDB gets 3X to 5X more TPS than MongoDB.

TokuMX is faster because it doesn't use (or waste) random IOPs on writes so more IO capacity is available for reads. In this workload an update is a read-modify-write operation where the read is likely to require a disk read.
MongoDB is slower for two reasons. The first reason is the per-database RW-lock and the result doesn't get better with more concurrent clients. For this test all collections were in one database. The lock is held while the b-tree index for the PK is searched to find the document to update. Disk reads might be done when the lock is held. The second reason is that it does twice the number of disk reads per update while InnoDB & TokuMX do about 1 per update. Part of the difference is that InnoDB and TokukMX have clustered PK indexes but the results are much worse than I expected for MongoDB. I wonder if caching of index leaf blocks is not as effective as I expect or if I am wrong to expect this. Maybe this is one of the problems of depending on the OS VM to cache the right data.

Yield on page fault

The TPS results for MongoDB are limited by disk read latency. Even though there is a disk array that can do a few thousand random reads per second, the array sustains about 150 reads/second when there is a single stream of IO requests. And the per-database RW-lock guarantees that is the case. So MongoDB won't get more than 1 / disk-read-latency updates per second for this test regardless of the number of disks in the array or number of concurrent clients.

MongoDB documentation mentions that the per-database RW-lock can be yielded on page faults but the documentation wasn't specific enough for me. I think this is what you need to know and I hope MongoDB experts correct any mistakes.

Yield is only done for access to documents. It is not done while accessing primary or secondary indexes. To see in the code where a yield might be done search for calls to Record::_accessing() which throws PageFaultException. The record might also be "prefetched" after releasing the per-database RW-lock via a call to Record::touch().
Yield is done on predicted page faults, not on actual page faults. AFAIK, a signal handler for SIGSEGV could be used to do this for actual page faults and MongoDB creates a handler for SIGSEGV but only to print a stack trace before exiting. MongoDB has something like an LRU to track memory references and predict page faults. I haven't spent much time trying to figure out that code but have seen those functions use a lot of CPU time for some benchmarks. I am curious why the btree code uses that tracking code (it calls likelyInPhysicalMemory). To learn more about the page fault prediction code read the functions Record::likelyInPhysicalMemory and Record::_accessing and the classes PageFaultException and Rolling.

From reading the above you should assume that you really want all indexes to be cached in RAM. Alas that can be hard to do for big data databases. For this test my server has 72G of RAM and the PK indexes are 83G. So I know that all of the indexes won't be cached.

I tried to overcome disk read stalls during index searching by changing the Java sysbench client to manually prefetch the to-be-updated document by calling findOne prior to the update. That improved TPS by about 20%. I hoped for more but the prefetch attempt needs a read-lock and pending write-lock requests on the per-database RW-lock appear to block new read-lock requests. I think this is done to prevent write-lock requests from getting starved. My attempt is not a workaround.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has a SAS disk array that can do more than 2000 IOPs with many concurrent requests, 16 CPU cores with HT enabled and 72G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested these configurations:

mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
mongo-p2n - 828 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
mysql - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=2
mysql-zlib - 349 GB database, MySQL 5.6.12, InnoDB 2X compression (key_block_size=4) via zlib, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=1
tokumz-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0
tokumz-zlib - 385 GB database, TokuMX 1.4.1 with zlib compression, logFlushPeriod=300, w:1,j:0

Results

MongoDB does twice the number of disk reads per update compared to TokuMX and InnoDB. MongoDB TPS does not increase with concurrency. TPS does increase with concurrency for InnoDB and TokuMX which benefit from having many more concurrent pending disk reads. TokuMX does better than InnoDB because it doesn't use random IOPs for database page writes so there is more capacity remaining for reads.

TPS
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 888 1267 1647 2034
tokumx-quicklz 870 1224 1567 1915
mysql-zlib 562 809 983 1140
mysql 543 737 913 1043
mongo-p2y 168 168 169 169
mongo-p2n 168 169 168 169

iostat r/s
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 924 1279 1650 2032
tokumx-quicklz 891 1243 1600 1948
mysql-zlib 520 727 862 966
mysql 512 695 855 970
mongo-p2y 337 340 342 344
mongo-p2n 343 347 350 350

disk reads per update
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 1.04 1.01 1.00 1.00
tokumx-quicklz 1.02 1.02 1.02 1.02
mysql-zlib 0.93 0.90 0.88 0.85
mysql 0.94 0.94 0.94 0.93
mongo-p2y 2.01 2.02 2.02 2.04
mongo-p2n 2.04 2.05 2.08 2.07

Sunday, April 13, 2014

Why aren't you using X, version 2

Sometimes I get asked why am I not using product X where X is anything but MySQL. The products that are suggested change over time and the value of X very much depends on the person asking the question. An ex-manager from my days at Oracle told me that Oracle would be better and developers from the SQL Server team told me the same. For those keeping score there was a social network that ran SQL Server and they were kind of enough to explain why.

Too often this is an assertion rather than a question and it would be more clear to say "I think you should be using X". A better question would be "Why are you using MySQL". This is the burden we carry for running MySQL at scale, but I am not in search of sympathy. There are several possible answers.

Getting better with age

This might explain existing deployments:

It was there when we arrived
We made it better (along with Oracle, Percona, MariaDB, etc)

It is pretty good

With the quality of the 5.6 release and features likely to appear in 5.7, I expect MySQL to get many new deployments. This isn't just a legacy thing. MySQL performance is excellent for IO-bound workloads and almost excellent for in-memory. Manageability is about to get much better when GTID, parallel apply and enhanced semi-sync are deployed (or deployable). It isn't perfect. We need more features, PS usability is a work in progress and all of the replication goodness might not be there for most users until 5.7 is GA.

Scalability

Sometimes I am told that something else scales better, but scalability is rarely defined. Context is very important here. If your deployment has a few servers then you want to minimize management overhead as the cost for people is larger than for hardware. But things change when a small team runs a huge number of servers and for that it is very important to minimize hardware cost by using a DBMS that is efficient for some of high QPS, IO-bound, read-heavy or write-heavy workloads. Note that a small team running a huge deployment is an existence proof that one or both of these are true -- the team is awesome, the product is manageable.

Leaving out quality of service, a simple definition for scalability is that a given workload requires A people, B hardware units and C lines of automation code. For something to scale better than MySQL it should reduce some of A, B and C. For many web-scale deployments the cost of C has mostly been paid and migrating to something new means a large cost for C. Note that B represents many potential bottlenecks. The value of B might be large to get more IOPs for IO-bound workloads with databases that are much bigger than RAM. It might be large to get more RAM to keep everything cached. Unfortunately, some deployments are not going to fully describe that context (some things are secret). The value of A is influenced by the features in C and the manageability features in the DBMS but most web-scale companies don't disclose the values of B and A.

Thursday, April 10, 2014

MongoDB, TokuMX and InnoDB for concurrent inserts

I used the insert benchmark with concurrent insert threads to understand performance limits in MongoDB, TokuMX and InnoDB. The database started empty and eventually was much larger than RAM. The benchmark requires many random writes for secondary index maintenance for an update-in-place b-tree used by MongoDB and InnoDB. The test server has fast flash storage. The work per transaction for this test is inserting 1000 documents/rows where each document/row is small (100 bytes) and has 3 secondary indexes to maintain. The test used 10 client connections to run these transactions concurrently and each client uses a separate collection/table. The performance summaries listed below are based on the context for this test -- fast storage, insert heavy with secondary index maintenance. My conclusion from running many insert benchmark tests is that I don't want to load big databases with MongoDB when secondary index maintenance must be done during the load. Sometimes creating the indexes after the load is an option but performance in this area must be improved.

The performance summary for the workload when the database is cached (smaller than RAM).

InnoDB and TokuMX are always much faster MongoDB except when database-per-collection is used and the MongoDB journal is disabled. I did not attempt to run InnoDB or TokuMX with the journal disabled or even on tmpfs so I am not comparing the same thing in that case. Reasons for better performance from InnoDB include the insert buffer, less bloat in the database, more internal concurrency and a more mature b-tree (sometimes older is better). Reasons for better performance from TokuMX include fractal trees, compression and more internal concurrency. AFAIK, the MongoDB write lock is not released when there is a disk read (page fault) during secondary index maintenance. Even when there aren't faults at most one client searches the b-tree indexes at a time.
InnoDB is faster than TokuMX
MongoDB inserts/second are doubled by using a database-per-collection compared to one database for all collections
MongoDB inserts/second are doubled by disabling the journal
MongoDB performance doesn't change much between using j:1 (fsync-on-commit) and w:1,j:0 (fsync a few times per second).

The performance summary for the workload when the database is much larger than RAM.

Eventually TokuMX is much faster than InnoDB. This is expected for the insert benchmark.
TokuMX and InnoDB are much faster than MongoDB. TPS degrades as the database size grows: not much for TokuMX, faster for InnoDB, really fast for MongoDB.
Disabling the journal doesn't help MongoDB The bottleneck is elsewhere.
Not using fsync-on-commit doesn't help MongoDB. The bottleneck is elsewhere.
Using database-per-collection doesn't do much to help MongoDB. The bottleneck is elsewhere.

Configuration

The Java and Python insert benchmark clients were used to load up to 3B documents/rows into 10 collections/tables, which is also up to 300M documents/rows per collection/table. For MongoDB the tests were run with all collections in one database and then again with a database-per-collection. Tests were run as a sequence of rounds where 100M documents/rows were loaded per round (10M per client) and performance metrics were computed per round. The test hosts have 144GB of RAM, PCIe flash and 24 CPU cores with HT enabled. The client and DBMS software ran on the same host. Several configurations were tested:

inno-lazy - MySQL 5.6.12, InnoDB, doublewrite=0, page_size=8k, flush_log_at_trx_commit=2
inno-sync - MySQL 5.6.12, InnoDB, doublewrite=0, page_size=8k, flush_log_at_trx_commit=1
toku-lazy - TokuMX 1.4.1, logFlushPeriod=300, w:1,j:0
toku-sync - TokuMX 1.4.1, logFlushPeriod=0, j:1
mongo24-1db-noj - MongoDB 2.4.9, nojournal=true, 10 collections in 1 database
mongo26-1db-noj - MongoDB 2.6.0rc2, nojournal=true, 10 collections in 1 database
mongo24-10db-noj - MongoDB 2.4.9, nojournal=true, database per collection
mongo26-10db-noj - MongoDB 2.6.0rc2, nojournal=true, database per collection
mongo24-1db-lazy - MongoDB 2.4.9, journalCommitInterval=300, w:1,j:0, 10 collections in 1 database
mongo26-1db-lazy - MongoDB 2.6.0rc2, journalCommitInterval=300, w:1,j:0, 10 collections in 1 database
mongo24-10db-lazy - MongoDB 2.4.9, journalCommitInterval=300, w:1,j:0, database per collection
mongo26-10db-lazy - MongoDB 2.6.0rc2, journalCommitInterval=300, w:1,j:0, database per collection
mongo24-1db-sync - MongoDB 2.4.9, journalCommitInterval=2, j:1, 10 collections in 1 database
mongo26-1db-sync - MongoDB 2.6.0rc2, journalCommitInterval=2, j:1, 10 collections in 1 database
mongo24-10db-sync - MongoDB 2.4.9, journalCommitInterval=2, j:1, database per collection
mongo26-10db-sync - MongoDB 2.6.0rc2, journalCommitInterval=2, j:1, database per collection

Results @100M rows

Legend for the columns:

DB-size - size of the database at the end of the test round
Bytes-per-doc - count of documents/rows divided by DB-size
Write-rate - average rate of bytes written to storage during the test round measured by iostat
Bytes-written - total bytes written to storage during the test round
Test-secs - number of seconds to complete the test round
TPS - average transactions per second during the test round where each transaction inserts 1000 documents/rows
Server - the tested configuration

Notes:

I don't know why inno-sync is a bit faster than inno-lazy, maybe HW was the cause. The key point is that doing fsync-on-commit isn't significant for this test. It also has a small impact for TokuMX.
MongoDB TPS is only close to InnoDB & TokuMX when the journal is disabled and a collection per database is used
There is up to a 2X benefit with MongoDB from using a database per collection

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
17 GB 182 164 MB/s 77 GB 494 202672 inno-sync
17 GB 182 159 MB/s 81 GB 540 185742 inno-lazy
15 GB 161 60 MB/s 42 GB 686 145720 toku-sync
14 GB 150 61 MB/s 41 GB 663 150718 toku-lazy
X GB X 140 MB/s 1606 GB 11451 8743 mongo24-1db-sync
38 GB 400 148 MB/s 1566 GB 10586 9463 mongo26-1db-sync
60 GB 644 205 MB/s 1337 GB 6485 15921 mongo24-10db-sync
40 GB 429 216 MB/s 1310 GB 6069 16933 mongo26-10db-sync
36 GB 387 138 MB/s 1611 GB 11640 8593 mongo24-1db-lazy
X GB X 147 MB/s 1574 GB 10702 9352 mongo26-1db-lazy
60 GB 644 181 MB/s 1267 GB 6989 14684 mongo24-10db-lazy
40 GB 429 189 MB/s 1251 GB 6629 15610 mongo26-10db-lazy
37 GB 397 206 MB/s 995 GB 4819 20752 mongo24-1db-noj
X GB X 234 MB/s 878 GB 3742 26736 mongo26-1db-noj
60 GB 644 370 MB/s 456 GB 1226 81663 mongo24-10db-noj
40 GB 429 600 MB/s 348 GB 577 179224 mongo26-10db-noj

Results @500M rows

MongoDB TPS starts to drop as the database becomes larger than RAM.

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
83 GB 178 269 MB/s 183 GB 708 141290 inno-sync
83 GB 178 269 MB/s 180 GB 710 141611 inno-lazy
44 GB 94 62 MB/s 59 GB 948 105421 toku-sync
46 GB 99 61 MB/s 56 GB 918 108930 toku-lazy
180 GB 387 147 MB/s 2284 GB 15519 6454 mongo24-1db-sync
200 GB 429 158 MB/s 2277 GB 14394 6964 mongo26-1db-sync
220 GB 472 249 MB/s 2230 GB 8938 11886 mongo24-10db-sync
220 GB 472 262 MB/s 2232 GB 8528 12041 mongo26-10db-sync
180 GB 387 146 MB/s 2287 GB 15696 6374 mongo24-1db-lazy
199 GB 427 154 MB/s 2278 GB 14791 6773 mongo26-1db-lazy
219 GB 470 226 MB/s 2197 GB 9718 10667 mongo24-10db-lazy
219 GB 470 233 MB/s 2190 GB 9389 10817 mongo26-10db-lazy
180 GB 387 261 MB/s 1997 GB 7640 13094 mongo24-1db-noj
X GB X 291 MB/s 1980 GB 6800 14736 mongo26-1db-noj
220 GB 472 758 MB/s 1721 GB 2271 44067 mongo24-10db-noj
220 GB 472 741 MB/s 1438 GB 1703 60036 mongo26-10db-noj

Results @1B rows

The gap widens between InnoDB/TokuMX and MongoDB. Another result that I saw is uneven durations for the test clients with MongoDB. With InnoDB/TokuMX the clients usually finish within a few seconds. With MongoDB I frequently see test runs where a few clients take hundreds of seconds more than other clients.

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
159 GB 170 292 MB/s 292 GB 1051 95067 inno-sync
159 GB 170 291 MB/s 291 GB 1057 94994 inno-lazy
74 GB 79 63 MB/s 67 GB 1053 95443 toku-sync
74 GB 79 67 MB/s 69 GB 1030 97071 toku-lazy
400 GB 430 83 MB/s 2328 GB 28151 3557 mongo24-1db-sync
380 GB 408 87 MB/s 2323 GB 26771 3746 mongo26-1db-sync
419 GB 450 146 MB/s 2300 GB 15821 6880 mongo24-10db-sync
400 GB 430 148 MB/s 2293 GB 15520 6989 mongo26-10db-sync
400 GB 430 83 MB/s 2327 GB 28461 3517 mongo24-1db-lazy
380 GB 408 86 MB/s 2322 GB 26937 3718 mongo26-1db-lazy
420 GB 451 143 MB/s 2280 GB 15949 6555 mongo24-10db-lazy
400 GB 430 148 MB/s 2267 GB 15361 6880 mongo26-10db-lazy
400 GB 430 104 MB/s 2100 GB 20291 4938 mongo24-1db-noj
X GB X 110 MB/s 2098 GB 19147 5246 mongo26-1db-noj
420 GB 451 206 MB/s 2068 GB 10039 10037 mongo24-10db-noj
400 GB 430 214 MB/s 2063 GB 9623 10662 mongo26-10db-noj

Results @1.5B rows

Tests for mongo24-1db-lazy and mongo26-1db-lazy were stopped. I wasn't willing to wait. TPS for MongoDB continues to degrade faster than InnoDB & TokuMX.

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
229 GB 163 304 MB/s 371 GB 1290 77491 inno-sync
229 GB 163 303 MB/s 372 GB 1300 77108 inno-lazy
110 GB 78 75 MB/s 80 GB 1072 93240 toku-sync
110 GB 78 75 MB/s 81 GB 1068 93575 toku-lazy
540 GB 387 54 MB/s 2336 GB 43013 2327 mongo24-1db-sync
580 GB 415 56 MB/s 2330 GB 41803 2400 mongo26-1db-sync
560 GB 401 84 MB/s 2322 GB 27778 3790 mongo24-10db-sync
600 GB 429 82 MB/s 2315 GB 28257 3708 mongo26-10db-sync
560 GB 401 93 MB/s 2316 GB 24887 4218 mongo24-10db-lazy
600 GB 429 94 MB/s 2312 GB 24638 4185 mongo26-10db-lazy
544 GB 389 60 MB/s 2116 GB 35192 2845 mongo24-1db-noj
X GB X 62 MB/s 2118 GB 34123 2951 mongo26-1db-noj
560 GB 401 96 MB/s 2104 GB 21903 4837 mongo24-10db-noj
600 GB 429 98 MB/s 2090 GB 21276 5115 mongo26-10db-noj

Results @1.7B rows

Tests for mongo2?-*db-sync were stopped at 1.7B rows. I wasn't willing to wait.

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
620 GB 392 50 MB/s 2336 GB 46506 2153 mongo24-1db-sync
620 GB 392 52 MB/s 2333 GB 45273 2218 mongo26-1db-sync
639 GB 404 74 MB/s 2318 GB 31264 3348 mongo24-10db-sync
640 GB 404 73 MB/s 2317 GB 31960 3266 mongo26-10db-sync

Results @2B rows

More of the same.

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
305 GB 163 267 MB/s 449 GB 1777 56930 inno-sync
305 GB 163 265 MB/s 445 GB 1774 56466 inno-lazy
141 GB 76 77 MB/s 80 GB 1037 96401 toku-sync
141 GB 76 78 MB/s 79 GB 1019 97993 toku-lazy
704 GB 378 50 MB/s 2125 GB 42575 2352 mongo24-1db-noj
X GB X 51 MB/s 2124 GB 41401 2422 mongo26-1db-noj
720 GB 387 69 MB/s 2114 GB 30779 3269 mongo24-10db-noj
740 GB 397 65 MB/s 2111 GB 32501 3102 mongo26-10db-noj

Results @2.5B rows

More of the same. The inno-sync test was stopped. I was impatient but TPS was still pretty good.

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
376 GB 161 228 MB/s 617 GB 2863 35489 inno-lazy
180 GB 76 82 MB/s 87 GB 1044 94770 toku-sync
180 GB 76 83 MB/s 85 GB 1023 97588 toku-lazy
840 GB 361 47 MB/s 2202 GB 46782 2139 mongo24-1db-noj
X GB X 49 MB/s 2207 GB 45456 2209 mongo26-1db-noj
860 GB 369 62 MB/s 2198 GB 35462 2824 mongo24-10db-noj
920 GB 395 58 MB/s 2199 GB 37755 2652 mongo26-10db-noj

Results @3B rows

More of the same. A few tests are still running and I will update this in a few days if they finish.

DB-size Bytes-per-doc Write-rate Bytes-written Test-secs TPS Server
448 GB 160 220 MB/s 665 GB 3186 31831 inno-lazy
212 GB 76 84 MB/s 93 GB 1091 91657 toku-sync
214 GB 77 86 MB/s 92 GB 1066 93724 toku-lazy
960 GB 344 43 MB/s 2105 GB 48793 2052 mongo24-1db-noj
1100 GB 394 45 MB/s 2107 GB 47366 2123 mongo26-1db-noj
980 GB 351 53 MB/s 2529 GB 39545 2530 mongo24-10db-noj
1100 GB 394 50 MB/s 2109 GB 42416 2362 mongo26-10db-noj