Wednesday, October 15, 2014

Updates with secondary index maintenance: 5.7 vs previous releases

My previous results for an update-only workload to a cached and IO-bound database used a table that did not require secondary index maintenance from the update. Prior to that I ran tests using mysqlslap and might have found a regression in InnoDB when the update requires secondary index maintenance (bug 74235). The mysqlslap test did all updates to the same row. The test I describe here chooses the row to update at random and this workload does not reproduce the worst case from bug 74235.

The results are that newer releases tend to do worse at 1 thread, 5.7.5 does best at 32 threads and 5.7.5 does slightly worse than 5.6.21 at less than 32 threads.  We need to fix the regressions in 5.7.5 at less-than-high concurrency workloads.

setup

The test here was the same as described in my previous posts with one exception -- the test tables have a secondary index on (k,c) and note that the column c is set to 0 at test start. The updates are of the form: UPDATE $table SET c = c + 1 WHERE id = $X. The test tables look like:
CREATE TABLE sbtest1 (
  id int(10) unsigned NOT NULL auto_increment,
  k int(10) unsigned NOT NULL default '0',
  c char(120) NOT NULL default '',
  pad char(60) NOT NULL default '',
  PRIMARY KEY  (id),
  KEY k (k,c)
) ENGINE=InnoDB AUTO_INCREMENT=8000001 DEFAULT CHARSET=latin1

There were 8 tables in the database with 8M rows per table. Tests were run for 1 thread with 1 table and 8, 16, 32 threads with 8 tables. The test at each concurrency level was run for 10 minutes. Tests were run for a 1G (IO-bound) and 32G (cached) InnoDB buffer pool. For the cached test the tables were read into the buffer pool at test start. The updates/second rate from these tests had more variance than previous tests. I ran tests many times and report rates here that were typical. For the 1-thread tests that use 1 table, the database (1 table) is 2G. For the 8, 16 and 32 thread tests that use 8 tables the database (8 tables) is 16G.

results at 1 thread

The rates for a cached database and 1 thread match most previous results -- newer releases are much slower than older releases and MySQL 5.7.5 is about 1.5X worse than MySQL 5.0.85. The results for the IO-bound setup don't follow that trend. Newer releases do better. I regret that I did not have the time to explain the difference between 5.6.21 and 5.7.5.


results at many threads

MySQL 5.7.5 and 5.6.21 are much better than older releases. 5.6.21 does better than 5.7.5 at 8 and 16 threads while 5.7.5 is the best at 32 threads.



configuration

Tuesday, October 14, 2014

Sysbench IO-bound updates: MySQL 5.7 vs previous releases

I repeated the sysbench update-only test after reducing the InnoDB buffer pool to 1G. The test database is 16G so the test should be IO-bound. MySQL 5.7.5 is 10% worse than 5.0.85 at 1 thread and much better at 8+ threads.

The previous blog post has more details on the setup. The configuration is not crash safe as the doublewrite buffer and binlog are disabled and innodb_flush_log_at_trx_commit=2. This test continued to use direct IO for InnoDB but this test requires a high rate of reads from storage. The previous test did many writes but no reads. A fast flash device is used for storage.

results at 1-thread

Only 1 table is used so there is a 2G database and 1G buffer pool. This test is less IO-bound than the many threads tests reported in the next section. Two graphs are displayed below. The first has the updates/second rates and the second has the rates normalized to the rate from 5.0.85. MySQL 5.7.5 is about 10% worse than 5.0.85 and all servers do between 3000 and 4000 updates/second.
results at many threads

These tests uniformly distribute clients across the 8 tables. MySQL 5.7.5 is the best at 16 and 32 threads. MySQL 5.0.85 and 5.1.63 saturate at 8 threads.

Monday, October 13, 2014

Sysbench cached updates: MySQL 5.7 vs previous releases

I used MySQL 5.7.5 to repeat tests I did for MySQL 5.7.2 and 5.6.10 with the sysbench client and a cached database to compare performance for low and high concurrency. My configuration was optimized for throughput rather than crash safety. The performance summary:
  • MySQL 5.7.5 is 1.47X slower than 5.0.85 at 1 thread
  • MySQL 5.7.5 is a bit slower than 5.6.21 at 8 and 16 threads
  • MySQL 5.7.5 is faster than others at 32 threads
setup

I used my sysbench 0.4 fork and set the value of the c column to 0 in the sbtest tables as that column is incremented during the test. The test used 8 tables with 8M rows per table (about 16G total) and the InnoDB buffer pool was 32G. One table was used for the 1 thread case. For the other tests the clients were uniformly distributed across the 8 tables. All data was in the buffer pool at test start. The test was run for 10 minutes at each concurrency level (1, 8, 16, 32 threads). Throughput appeared to be stable throughout the test.

Tests were run for MySQL versions 5.0.85, 5.1.63, 5.6.21 and 5.7.5. Tests were repeated for 4k and 8k innodb_page_size with MySQL 5.7.5 (see -16k, -8k and -4k below). Tests were also repeated with changes to innodb_flush_neighbors (=0), innodb_write_io_threads (=8), innodb_page_cleaners (=4) and innodb_purge_threads (=4). Results for those changes are not provided here as there wasn't an impact on QPS. The change to flush_neighbors reduced the InnoDB page write rate, which is a good thing.

results

The regression from 5.0.85 to recent versions is visible in the single-thread results below.
MySQL 5.7.5 is the champ at 32 threads but is slightly worse than 5.6.21 at 8 and 16 threads. Recent versions are much better than old versions.

configuration

I used this my.cnf for MySQL 5.7.5 and then something similar for other versions:
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 
table-definition-cache=1000
table-open-cache=2000
table-open-cache-instances=8
max_connections=20000
key_buffer_size=200M
metadata_locks_hash_instances=256 
query_cache_size=0
query_cache_type=0
server_id=9
performance_schema=0
binlog_format=row
skip_log_bin
innodb_buffer_pool_instances=8
innodb_io_capacity=32000
innodb_lru_scan_depth=4000
innodb_checksum_algorithm=CRC32
innodb_thread_concurrency=0
innodb_buffer_pool_size=32G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table

This was the command line for one of the tests:
./sysbench --batch --batch-delay=10 --test=oltp --mysql-host=127.0.0.1 --mysql-db=test --oltp-table-size=8000000 --max-time=600 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest1 --oltp-skip-trx --oltp-test-mode=incupdate --oltp-dist-type=uniform --oltp-range-size=10 --oltp-connect-delay=0 --percentile=99 --num-threads=1 --seed-rng=1413171854 run

Sunday, October 12, 2014

Page read performance: MySQL 5.7 vs previous releases

For this test I used sysbench to determine the peak rate for disk reads per second from InnoDB with fast storage. Fast storage in this case is the OS filesystem cache. I want to know the peak for MySQL 5.7.5 and whether there is a regression from 5.6. My summary is:
  • 5.7.5 is 12% to 15% worse than 5.6.21 at 1 thread. I hope we can fix this via bug 74342.
  • 5.7.5 is 2% to 4% worse than 5.6.21 at 8, 16 and 32 threads
  • 5.6/5.7 are much better than 5.0 especially at high concurrency
  • Page size (16k, 8k, 4k) has a big impact on the peak only when compression is used. 
setup

The test server has 40 hyperthread cores and is shared by the sysbench client and mysqld server.

I used my sysbench 0.4 fork configured to fetch 1 row by primary key via SELECT and HANDLER statements. The InnoDB buffer cache was 1G with innodb_flush_method="" (buffered IO). There were 8 tables in the test database with 16M rows per table.  The test database was either 32G uncompressed or 16G compressed so tests with compression had a better InnoDB buffer pool hit rate. Tests were run with 1, 8, 16 and 32 threads. The 1 thread test was limited to 1 table (2G or 4G of data) so this benefits from a better InnoDB buffer pool hit rate. The 8, 16 and 32 thread tests uniformly distribute clients across 8 tables.

Tests were done for MySQL 5.0.85, 5.6.21 and 5.0.85. All versions were tested for 16k pages without compression (see -16k-c0 below). Some MySQL 5.7.5 tests were repeated for 8k and 4k pages (see -4k- and -8k- below). Some MySQL 5.7.5 tests were repeated for 2X compression via zlib (see -c1- below).

results at 1 thread

The results at 1 thread show that peak QPS for 5.7.5 is 12% to 15% worse than for 5.6.21. I hope this can be fixed. I assume the problem is a longer code path from new features in 5.7.

results at 8 threads

The difference between 5.6.21 and 5.7.5 is smaller than at 1 thread. I assume this is the benefit from less mutex contention in 5.7.
results at 16 threads

The difference between 5.6.21 and 5.7.5 is smaller than at 1 thread. I assume this is the benefit from less mutex contention in 5.7.

results at 32 threads

The difference between 5.6.21 and 5.7.5 is smaller than at 1 thread. I assume this is the benefit from less mutex contention in 5.7.


configuration

I used this my.cnf for 5.7.5 and modified it for 5.6:
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 
table-definition-cache=1000
table-open-cache=2000
table-open-cache-instances=8
max_connections=20000
key_buffer_size=200M
metadata_locks_hash_instances=256 
query_cache_size=0
query_cache_type=0
server_id=9
performance_schema=0
binlog_format=row
skip_log_bin
innodb_buffer_pool_instances=8
innodb_io_capacity=32000
innodb_lru_scan_depth=4000
innodb_checksum_algorithm=CRC32
innodb_thread_concurrency=0
innodb_buffer_pool_size=1G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
# innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table
datadir=...

This is an example command line for a test that uses SELECT and 1 thread per table. My testing script starts 8 sysbench clients, one per table. The value for --seed-rng different for every sysbench client (function of current time and table number 1 to 8) to access different keys each run:
./sysbench --batch --batch-delay=10 --test=oltp --mysql-host=127.0.0.1 --mysql-db=test --oltp-table-size=16000000 --max-time=120 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest1 --oltp-read-only --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols --oltp-dist-type=uniform --oltp-range-size=10 --oltp-connect-delay=0 --percentile=99 --num-threads=1 --seed-rng=1413134256 run

Friday, October 10, 2014

Low-concurrency performance for updates with InnoDB: MySQL 5.7 vs previous releases

First I tested updates with the Heap engine as a base case and now I report on InnoDB to find regressions that are specific to it. The workload is described in the previous blog post. Response time is measured at 1, 4 and 32 threads for two types of updates. The unindexed update increments an unindexed column. The indexed update updates a column that is in a secondary index. Tests are run with the binlog enabled and disabled. Tests are done for MySQL versions 5.0.85, 5.1.63, 5.5.40, 5.6.21 and 5.7.5. In some cases tests are also done with the adaptive hash index disabled. The test database has 64,000 rows but all updates are to one row (lots of contention). The client is mysqlslap running on the same host as mysqld. The test server has 40 hyperthread cores.

For the single-thread update unindexed workload MySQL 5.7.5 is about 1.6X slower than 5.0.85. This is similar to what I reported in previous posts. For the single-thread updated indexed workload MySQL 5.7.5 is about 4X slower than 5.0.85 and much slower than 5.6.21. This is a regression for which I created bug 74325. This and a few other bugs might be motivation to get similar tests into the upstream release process.

The binaries below are tagged with -noahi, -nobl and -bl. The -noahi binary has the binlog and adaptive hash index disabled. The -nobl binary has the binlog disabled. The -bl binary has the binlog enabled.

update indexed, binlog off

Result summary:
  • at 1 thread: 5.7.5 is 3.8X slower, 5.6.21 is 1.43X slower
  • at 4 threads: 5.7.5 is 3.88X slower, 5.6.21 is slightly faster
  • at 32 threads: 5.7.5 is 1.96X slower, 5.6.21 is faster

update indexed, binlog on

Result summary:
  • at 1 thread: 5.7.5 is 4X slower, 5.6.21 is 1.37X slower
  • at 4 threads: 5.7.5 is 2.1X slower, 5.6.21 is 1.09X slower
  • at 32 threads: 5.7.5 is 2.26X slower, 5.6.21 is faster


update unindexed, binlog off

Result summary:
  • at 1 thread: 5.7.5 is 1.65X slower, 5.6.21 is 1.44X slower
  • at 4 threads: 5.7.5 is 1.33X slower, 5.6.21 is faster. Note that 5.7.5 is the only post-5.0.85 release still slower than 5.0.85 at 4 threads.
  • at 32 threads: 5.7.5 is faster, 5.6.21 is faster



update unindexed, binlog on

Result summary:
  • at 1 thread: 5.7.5 is 1.62X slower, 5.6.21 is 1.43X slower
  • at 4 threads: 5.7.5 is 1.39X slower, 5.6.21 is 1.18X slower
  • at 32 threads: 5.7.5 is faster, 5.6.21 is faster



Thursday, October 9, 2014

Low-concurrency performance for updates and the Heap engine: MySQL 5.7 vs previous releases

The low-concurrency performance regression evaluation continues after evaluating range queries. This post covers updates with the Heap engine. The Heap engine is used as the base case and the next post has results for InnoDB. The database has 64,000 rows but all updates were for the same row. Tests were done for updates to an indexed and then an unindexed column at 1, 4 and 32 threads. Tests were also repeated with the binlog disabled and then enabled. There are regressions at 1, 4 and 32 threads. The summary is:
  • Response time at 1 thread for 5.7.5 is between 1.55X and 1.67X worse than 5.0.85 
  • Response time at 32 threads for 5.7.5 is between 1.19X and 1.49X worse than 5.0.85 
  • In all cases it is worse in 5.7 than in 5.6
  • Regressions at 4 and 32 threads are much worse when the binlog is enabled
All of the graphs graphs use normalized response time. For each tested version that is the time at X threads divided by the time at X threads for MySQL 5.0.85 where X is 1, 4 and 32.

update unindexed, binlog off

This section has results for updates to an unindexed column with the binlog disabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.55X worse than for 5.0.85.



update unindexed, binlog on

This section has results for updates to an unindexed column with the binlog enabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.67X worse than for 5.0.85. The single-thread regression here is worse than with the binlog disabled, so there is some regression from the binlog code.


update indexed, binlog off

This section has results for updates to an indexed column with the binlog disabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.56X worse than for 5.0.85.



update indexed, binlog on

This section has results for updates to an indexed column with the binlog enabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.67X worse than for 5.0.85. The single-thread regression here is worse than with the binlog disabled, so there is some regression from the binlog code.



command lines

Tests used mysqlslap as described in previous posts. The command lines for updates to the non-indexed and indexed columns are:
mysqlslap --concurrency=$t -h127.0.0.1 --create-schema=test1 --number-of-queries=$(( $t * 100000 )) --create=${ddl} --query="update foo set l = l + 2 where i=100"

mysqlslap --concurrency=$t -h127.0.0.1 --create-schema=test1 --number-of-queries=$(( $t * 100000 )) --create=${ddl} --query="update foo set k = k + 2 where i=100"

Monday, October 6, 2014

Details on range scan performance regression in MySQL 5.7

My last blog post was too long to explain performance regressions for range scans with InnoDB in MySQL 5.7.
  • For a secondary index a non-covering scan degrades much more at high-concurrency than a covering scan. The problem gets worse as the scan length increases. This problem isn't new in 5.7 but it is much worse in 5.7 versus 5.6. At 32 threads the problem is waits on btr_search_latch and hash_table_locks. I opened bug 74283 for this.
  • A covering scan of a secondary index degrades much more at high-concurrency than a PK scan. This problem is equally bad in 5.7 and 5.6. Wait times reported by the performance schema do not explain the problem -- search below for PFS_NOT_INSTRUMENTED to understand why. Using Linux perf I definitely see more mutex contention in the covering index scan. But I don't understand the internal differences between PK and covering secondary index scans. I opened bug 74280 for this.
non-covering vs covering

These graphs show the absolute response time in seconds, not the normalized response time as used in my previous posts. The response time difference between non-covering and covering secondary index scans is much larger at high-concurrency than at low-concurrency. For the 10-row scan the non-covering index is close to the covering index response time at 1 thread but at 32 threads it is almost 2X worse. For the 1000-row scan the non-covering index is about 2X worse at 1 thread and then at 32 threads is about 8X worse in 5.6 and about 13X worse in 5.7. From the PS data that follows the problem is much more wait time on btr_search_latch and hash_table_locks.


covering vs PK

These graphs show absolute response time for the 1000-row scan using MySQL 5.6 and 5.7. There is not much difference in response time between the PK and covering secondary index scans at 1 thread. There is a big difference at 32 threads. 
Performance Schema

Assuming the problem is mutex contention then we have to perfect tool to diagnose the problem -- the Performance Schema. I enabled it to instrument wait/synch/% objects and the results are interesting. I ran the scan tests at 32 threads for the q1, q4 and q5 range scans (q1 is PK, q4 is non-covering secondary, q5 is covering secondary). Tests were done for 10-row and 1000-row scans with the adaptive hash enabled and then disabled. For each configuration the top-10 wait events are listed below. Note that the q4 query runs for much more time than both q1 and q5 and that q5 runs for about 2X longer than q1. I have not normalized the wait times to account for that.

It took the following amount of time to finish the tests
  • adaptive hash on, 10 rows: 23, 35, 23 seconds for q1, q4, q5
  • adaptive hash off, 10 rows: 23, 36, 23 seconds for q1, q4, q5
  • adaptive hash on, 1000 rows: 173, 3540, 237 seconds for q1, q4, q5
  • adaptive hash off, 1000 rows: 137, 3440, 254 seconds for q1, q4, q5

10-row PK scan at 32 threads, adaptive hash index enabled

This is the base case. It was about as fast as the covering index scan.

Seconds    Event
7.50       wait/synch/sxlock/innodb/hash_table_locks
2.57       wait/synch/rwlock/sql/LOCK_grant
2.51       wait/synch/mutex/innodb/fil_system_mutex
2.48       wait/synch/mutex/sql/THD::LOCK_query_plan
1.56       wait/synch/mutex/sql/THD::LOCK_thd_data
1.21       wait/synch/mutex/sql/LOCK_table_cache
1.20       wait/synch/sxlock/innodb/btr_search_latch
0.84       wait/synch/mutex/sql/THD::LOCK_thd_query
0.76       wait/synch/mutex/innodb/trx_pool_mutex
0.75       wait/synch/sxlock/innodb/index_tree_rw_lock

10-row covering scan at 32 threads, adaptive hash index enabled

Compared to the 10-row PK scan, this has:

  • More wait time on btr_search_latch (3.89 vs 1.20 seconds)

Seconds    Event
7.49       wait/synch/sxlock/innodb/hash_table_locks
3.89       wait/synch/sxlock/innodb/btr_search_latch
2.69       wait/synch/mutex/sql/THD::LOCK_query_plan
2.60       wait/synch/mutex/innodb/fil_system_mutex
2.37       wait/synch/rwlock/sql/LOCK_grant
1.45       wait/synch/mutex/sql/THD::LOCK_thd_data
1.22       wait/synch/mutex/sql/LOCK_table_cache
0.88       wait/synch/mutex/sql/THD::LOCK_thd_query
0.74       wait/synch/sxlock/innodb/index_tree_rw_lock
0.68       wait/synch/mutex/innodb/trx_pool_mutex

10-row non-covering scan at 32 threads, adaptive hash index enabled

Compared to the 10-row covering scan, this has:
  • Much more wait time on btr_search_latch (15.60 vs 3.89 seconds)
  • A bit more time on hash_table_locks (9.16 vs 7.49 seconds)
  • A bit less time on LOCK_grant and fil_system_mutex (~1.5 vs ~2.5 seconds)
Seconds    Event
15.60      wait/synch/sxlock/innodb/btr_search_latch
 9.16      wait/synch/sxlock/innodb/hash_table_locks
 2.75      wait/synch/mutex/sql/THD::LOCK_query_plan
 1.54      wait/synch/mutex/innodb/fil_system_mutex
 1.47      wait/synch/mutex/sql/THD::LOCK_thd_data
 1.42      wait/synch/rwlock/sql/LOCK_grant
 1.15      wait/synch/mutex/sql/LOCK_table_cache
 0.96      wait/synch/mutex/sql/THD::LOCK_thd_query
 0.84      wait/synch/sxlock/innodb/index_tree_rw_lock
 0.71      wait/synch/mutex/innodb/trx_pool_mutex

10-row PK scan at 32 threads, adaptive hash index disabled

This is the base case. It was about as fast as the covering index scan whether or not the adaptive hash index was enabled.

Seconds    Event
8.47       wait/synch/sxlock/innodb/hash_table_locks
2.65       wait/synch/rwlock/sql/LOCK_grant
2.58       wait/synch/mutex/innodb/fil_system_mutex
2.42       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/THD::LOCK_thd_data
1.25       wait/synch/sxlock/innodb/index_tree_rw_lock
1.17       wait/synch/mutex/sql/LOCK_table_cache
0.87       wait/synch/mutex/sql/THD::LOCK_thd_query
0.65       wait/synch/mutex/innodb/trx_pool_mutex
0.08       wait/io/file/innodb/innodb_data_file

10-row covering scan at 32 threads, adaptive hash index disabled

The waits here are similar to the PK scan.

Seconds    Event
8.53       wait/synch/sxlock/innodb/hash_table_locks
2.67       wait/synch/rwlock/sql/LOCK_grant
2.54       wait/synch/mutex/innodb/fil_system_mutex
2.47       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/LOCK_table_cache
1.44       wait/synch/mutex/sql/THD::LOCK_thd_data
1.25       wait/synch/sxlock/innodb/index_tree_rw_lock
0.93       wait/synch/mutex/sql/THD::LOCK_thd_query
0.68       wait/synch/mutex/innodb/trx_pool_mutex
0.08       wait/io/file/innodb/innodb_data_file

10-row non-covering scan at 32 threads, adaptive hash index disabled

Compared to the covering index scan, this has:
  • Much more time in hash_table_locks (30.97 vs 8.53 seconds)
  • Much more time in index_tree_rw_lock (8.80 vs 1.25 seconds)

Seconds    Event
30.97      wait/synch/sxlock/innodb/hash_table_locks
 8.80      wait/synch/sxlock/innodb/index_tree_rw_lock
 2.76      wait/synch/mutex/sql/THD::LOCK_query_plan
 2.14      wait/synch/rwlock/sql/LOCK_grant
 2.12      wait/synch/mutex/innodb/fil_system_mutex
 1.46      wait/synch/mutex/sql/THD::LOCK_thd_data
 1.24      wait/synch/mutex/sql/LOCK_table_cache
 1.00      wait/synch/mutex/sql/THD::LOCK_thd_query
 0.66      wait/synch/mutex/innodb/trx_pool_mutex
 0.08      wait/io/file/innodb/innodb_data_file

1000-row PK scan at 32 threads, adaptive hash index enabled

This is the base case.

Seconds    Event
8.21       wait/synch/sxlock/innodb/hash_table_locks
3.13       wait/synch/mutex/sql/THD::LOCK_query_plan
1.57       wait/synch/mutex/sql/THD::LOCK_thd_data
1.08       wait/synch/mutex/sql/LOCK_table_cache
1.04       wait/synch/mutex/sql/THD::LOCK_thd_query
0.69       wait/synch/rwlock/sql/LOCK_grant
0.66       wait/synch/mutex/innodb/trx_pool_mutex
0.64       wait/synch/sxlock/innodb/btr_search_latch
0.62       wait/synch/mutex/innodb/fil_system_mutex
0.60       wait/synch/sxlock/innodb/index_tree_rw_lock

1000-row covering scan at 32 threads, adaptive hash index enabled

This is about 1.4X slower than the PK scan at 1000 rows. The waits here are about the same as the PK case but the top wait event, hash_table_locks, is less here than in the PK scan. There isn't a wait event that explains the difference in performance. Perhaps the difference comes from CPU instructions or cache misses. The reason why the PS doesn't explain the difference is that the per-block rw-lock doesn't use the PS, from buf_block_init() there is this code:
   rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);


Seconds    Event
6.45       wait/synch/sxlock/innodb/hash_table_locks
3.22       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/THD::LOCK_thd_data
1.00       wait/synch/mutex/sql/THD::LOCK_thd_query
0.93       wait/synch/mutex/sql/LOCK_table_cache
0.72       wait/synch/mutex/innodb/trx_pool_mutex
0.70       wait/synch/sxlock/innodb/btr_search_latch
0.65       wait/synch/mutex/innodb/fil_system_mutex
0.63       wait/synch/rwlock/sql/LOCK_grant
0.56       wait/synch/sxlock/innodb/index_tree_rw_lock

1000-row non-covering scan at 32 threads, adaptive hash index enabled

This is 15 to 20 times slower than the PK and covering index scans. Compared to the covering index scan this has:
  • Much more time in btr_search_latch (655.57 vs 0.70 seconds)
  • Much more time in hash_table_locks (35.47 vs 6.45 seconds)
  • Much more time in index_tree_rw_lock (9.68 vs 0.56 seconds)
Seconds    Event
655.57     wait/synch/sxlock/innodb/btr_search_latch
 35.47     wait/synch/sxlock/innodb/hash_table_locks
  9.68     wait/synch/sxlock/innodb/index_tree_rw_lock
  3.42     wait/synch/mutex/sql/THD::LOCK_query_plan
  1.65     wait/synch/mutex/sql/THD::LOCK_thd_data
  1.26     wait/synch/mutex/sql/THD::LOCK_thd_query
  0.86     wait/synch/mutex/sql/LOCK_table_cache
  0.74     wait/synch/mutex/innodb/trx_pool_mutex
  0.73     wait/synch/rwlock/sql/LOCK_grant
  0.43     wait/synch/mutex/innodb/fil_system_mutex

1000-row PK scan at 32 threads, adaptive hash index disabled

This is the base case.

Seconds    Event
9.80       wait/synch/sxlock/innodb/hash_table_locks
3.00       wait/synch/mutex/sql/THD::LOCK_query_plan
1.53       wait/synch/mutex/sql/THD::LOCK_thd_data
1.13       wait/synch/sxlock/innodb/index_tree_rw_lock
1.05       wait/synch/mutex/sql/LOCK_table_cache
1.02       wait/synch/mutex/sql/THD::LOCK_thd_query
0.79       wait/synch/mutex/innodb/trx_pool_mutex
0.64       wait/synch/rwlock/sql/LOCK_grant
0.58       wait/synch/mutex/innodb/fil_system_mutex
0.08       wait/io/file/innodb/innodb_data_file

1000-row covering scan at 32 threads, adaptive hash index disabled

This is about 2X slower than the PK scan at 1000 rows. The waits here are about the same as the PK case but the top wait event, hash_table_locks, is less here than in the PK scan. There isn't a wait event that explains the difference in performance. Perhaps the difference comes from CPU instructions or cache misses.

Seconds    Event
8.14       wait/synch/sxlock/innodb/hash_table_locks
3.00       wait/synch/mutex/sql/THD::LOCK_query_plan
1.51       wait/synch/mutex/sql/THD::LOCK_thd_data
0.99       wait/synch/sxlock/innodb/index_tree_rw_lock
0.99       wait/synch/mutex/sql/THD::LOCK_thd_query
0.95       wait/synch/mutex/sql/LOCK_table_cache
0.77       wait/synch/mutex/innodb/trx_pool_mutex
0.66       wait/synch/rwlock/sql/LOCK_grant
0.62       wait/synch/mutex/innodb/fil_system_mutex
0.08       wait/io/file/innodb/innodb_data_file

1000-row non-covering scan at 32 threads, adaptive hash index disabled

Wow, compared to the covering index scan this has:

  • Much more wait time on hash_table_locks (1434.73 vs 8.14 seconds)
  • Much more time on index_tree_rw_lock (659.07 vs 0.99 seconds)

Seconds    Event
1434.73    wait/synch/sxlock/innodb/hash_table_locks
 659.07    wait/synch/sxlock/innodb/index_tree_rw_lock
   3.25    wait/synch/mutex/sql/THD::LOCK_query_plan
   1.51    wait/synch/mutex/sql/THD::LOCK_thd_data
   0.97    wait/synch/mutex/sql/THD::LOCK_thd_query
   0.87    wait/synch/mutex/sql/LOCK_table_cache
   0.71    wait/synch/mutex/innodb/trx_pool_mutex
   0.67    wait/synch/rwlock/sql/LOCK_grant
   0.44    wait/synch/mutex/innodb/fil_system_mutex
   0.08    wait/io/file/innodb/innodb_data_file

Missing time?

I am trying to explain why the 1000-row covering index scan is slower than the PK scan. Output from the performance schema doesn't explain the difference. Perhaps the PS isn't instrumenting something. Looking at non-hierarchical output from perf I see a difference. This is from the PK scan:

     8.58%   mysqld  mysqld                [.] row_search_mvcc(...)                                     
     8.00%   mysqld  mysqld                [.] ut_delay(unsigned long)                
     4.87%   mysqld  mysqld                [.] rec_get_offsets_func(...)
     3.50%   mysqld  mysqld                [.] mtr_t::Command::release_all(
     3.12%   mysqld  libc-2.14.1.so        [.] __memcpy_ssse3_back
     3.02%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
     2.94%   mysqld  mysqld                [.] buf_page_optimistic_get(...)
     2.50%   mysqld  mysqld                [.] ha_innobase::general_fetch(...)

And this is from the covering index scan. For the secondary index scan the top two CPU consumers are the mutex busy wait loop. So there is mutex contention that isn't reported by the performance schema.

    24.30%   mysqld  mysqld                [.] ut_delay(unsigned long)
    16.51%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
     4.54%   mysqld  mysqld                [.] mtr_t::Command::release_all()
     4.51%   mysqld  mysqld                [.] row_search_mvcc(...)
     2.70%   mysqld  mysqld                [.] _ZL22pfs_rw_lock_s_lock_lowP9rw_lock_tmPKcm.isra.17
     2.25%   mysqld  libc-2.14.1.so        [.] __memcpy_ssse3_back
     2.15%   mysqld  mysqld                [.] buf_page_optimistic_get(...)
     2.03%   mysqld  mysqld                [.] rec_get_offsets_func(...)

From hierarchical perf output this is the result for the PK scan:

     8.55%   mysqld  mysqld                [.] row_search_mvcc(...)
             |
             --- row_search_mvcc(...)
                |          
                |--99.11%-- ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          handler::ha_index_next(unsigned char*)
                |          handler::read_range_next()
                |          handler::multi_range_read_next(char**)
                |          QUICK_RANGE_SELECT::get_next()
                |          _ZL8rr_quickP11READ_RECORD
                |          sub_select(JOIN*, QEP_TAB*, bool)
                |          JOIN::exec()
                |          mysql_select(...)
                |          handle_select(THD*, select_result*, unsigned long)
                |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          handle_connection
                |          pfs_spawn_thread
                |          start_thread
                |          __clone
                |          
                |--0.54%-- ha_innobase::index_read(...)
                |          handler::ha_index_read_map(...)
                |          handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)
                |          handler::multi_range_read_next(char**)
                |          QUICK_RANGE_SELECT::get_next()
                |          _ZL8rr_quickP11READ_RECORD
                |          sub_select(JOIN*, QEP_TAB*, bool)
                |          JOIN::exec()
                |          mysql_select(...)
                |          handle_select(THD*, select_result*, unsigned long)
                |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          handle_connection
                |          pfs_spawn_thread
                |          start_thread
                |          __clone
                 --0.34%-- [...]

     7.96%   mysqld  mysqld                [.] ut_delay(unsigned long)
             |
             --- ut_delay(unsigned long)
                |          
                |--99.96%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--99.85%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone
                |           --0.15%-- [...]
                 --0.04%-- [...]

Which looks very different from the result for the covering index scan:

    24.49%   mysqld  mysqld                [.] ut_delay(unsigned long)
             |
             --- ut_delay(unsigned long)
                |          
                |--99.98%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--98.79%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone

    16.22%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
             |
             --- TTASEventMutex<TrackPolicy>::spin_and_wait(...)
                |          
                |--99.71%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--98.90%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone