diff --git a/README.md b/README.md
index 85d813a..a034843 100644
--- a/README.md
+++ b/README.md
@@ -9,3 +9,14 @@ Extract some insightful technical points from the book.
[How to explain why Repeatable Read surprisingly outperforms Read Committed?](isolation.md)
[The Significant Differences Between BenchmarkSQL and SysBench](sysbench_vs_benchmarksql.md)
+
+[Profile-guided optimization](PGO.md)
+
+[Performance Degradation in Query Execution Plans](performance_degradation.md)
+
+[Enhancing the InnoDB Storage Engine](innodb_storage.md)
+
+[Improving Binlog Group Commit Scalability](binlog_group.md)
+
+[Evaluating Performance Gains in MySQL Lock Scheduling Algorithms](cats.md)
+
diff --git a/binlog_group.md b/binlog_group.md
new file mode 100644
index 0000000..0eaecec
--- /dev/null
+++ b/binlog_group.md
@@ -0,0 +1,139 @@
+# Improving Binlog Group Commit Scalability
+
+The binlog group commit mechanism is quite complex, and this complexity makes it challenging to identify its inherent performance problems.
+
+First, capture performance problems during the TPC-C test with 500 concurrency using the *perf* tool, as shown in the following figure:
+
+
+
+Figure 1. *_pthread_mutex_con_lock* bottleneck reveals performance problems.
+
+It is evident that *_pthread_mutex_con_lock* is a significant bottleneck, accounting for approximately 9.5% of the overhead. Although *perf* does not directly pinpoint the exact problem, it indicates the presence of this bottleneck.
+
+To address the problem, an in-depth exploration of MySQL internals was conducted to uncover the factors contributing to this performance bottleneck. A conventional binary search approach with minimal logging was used to identify functions or code segments that incur significant overhead during execution. The minimal logging approach was chosen to reduce performance interference while diagnosing the root cause of the problem. Excessive logging can disrupt performance analysis, and while some may use MySQL's internal mechanisms for troubleshooting, these often introduce substantial performance overhead themselves.
+
+After thorough investigation, the bottleneck was identified within the following code segment.
+
+```c++
+ /*
+ If the queue was not empty, we're a follower and wait for the
+ leader to process the queue. If we were holding a mutex, we have
+ to release it before going to sleep.
+ */
+ if (!leader) {
+ CONDITIONAL_SYNC_POINT_FOR_TIMESTAMP("before_follower_wait");
+ mysql_mutex_lock(&m_lock_done);
+ ...
+ ulonglong start_wait_time = my_micro_time();
+ while (thd->tx_commit_pending) {
+ if (stage == COMMIT_ORDER_FLUSH_STAGE) {
+ mysql_cond_wait(&m_stage_cond_commit_order, &m_lock_done);
+ } else {
+ mysql_cond_wait(&m_stage_cond_binlog, &m_lock_done);
+ }
+ }
+ ulonglong end_wait_time = my_micro_time();
+ ulonglong wait_time = end_wait_time - start_wait_time;
+ if (wait_time > 100000) {
+ fprintf(stderr, "wait too long:%llu\n", wait_time);
+ }
+ mysql_mutex_unlock(&m_lock_done);
+ return false;
+ }
+```
+
+Numerous occurrences of 'wait too long' output indicate that the bottleneck has been exposed. To investigate why 'wait too long' is being reported, the logs were added and modified accordingly. See the specific code below:
+
+```c++
+ /*
+ If the queue was not empty, we're a follower and wait for the
+ leader to process the queue. If we were holding a mutex, we have
+ to release it before going to sleep.
+ */
+ if (!leader) {
+ CONDITIONAL_SYNC_POINT_FOR_TIMESTAMP("before_follower_wait");
+ mysql_mutex_lock(&m_lock_done);
+ ...
+ ulonglong start_wait_time = my_micro_time();
+ while (thd->tx_commit_pending) {
+ if (stage == COMMIT_ORDER_FLUSH_STAGE) {
+ mysql_cond_wait(&m_stage_cond_commit_order, &m_lock_done);
+ } else {
+ mysql_cond_wait(&m_stage_cond_binlog, &m_lock_done);
+ }
+ fprintf(stderr, "wake up thread:%p,total wait time:%llu, stage:%d\n",
+ thd, my_micro_time() - start_wait_time, stage);
+ }
+ ulonglong end_wait_time = my_micro_time();
+ ulonglong wait_time = end_wait_time - start_wait_time;
+ if (wait_time > 100000) {
+ fprintf(stderr, "wait too long:%llu for thread:%p\n", wait_time, thd);
+ }
+ mysql_mutex_unlock(&m_lock_done);
+ return false;
+ }
+```
+
+ After another round of testing, a peculiar phenomenon was observed: when 'wait too long' messages appeared, the 'wake up thread' logs showed that many user threads were awakened multiple times.
+
+The problem was traced to the *thd->tx_commit_pending* value not changing, causing threads to repeatedly re-enter the wait process. Further inspection reveals the conditions under which this variable becomes false, as illustrated in the following code:
+
+```c++
+void Commit_stage_manager::signal_done(THD *queue, StageID stage) {
+ mysql_mutex_lock(&m_lock_done);
+ for (THD *thd = queue; thd; thd = thd->next_to_commit) {
+ thd->tx_commit_pending = false;
+ thd->rpl_thd_ctx.binlog_group_commit_ctx().reset();
+ }
+ /* if thread belong to commit order wake only commit order queue threads */
+ if (stage == COMMIT_ORDER_FLUSH_STAGE)
+ mysql_cond_broadcast(&m_stage_cond_commit_order);
+ else
+ mysql_cond_broadcast(&m_stage_cond_binlog);
+ mysql_mutex_unlock(&m_lock_done);
+}
+```
+
+From the code, it is evident that *thd->tx_commit_pending* is set to false in the *signal_done* function. The *mysql_cond_broadcast* function then activates all waiting threads, leading to a situation similar to a thundering herd problem. When all previously waiting user threads are activated, they check if tx_commit_pending has been set to false. If it has, they proceed with processing; otherwise, they continue waiting.
+
+Despite the complexity of the binlog group commit mechanism, a straightforward analysis identifies the root cause: threads that should not be activated are being triggered, leading to unnecessary context switches with each activation.
+
+During one test, additional statistics were collected on the number of times user threads entered the wait state. The details are shown in the following figure:
+
+
+
+Figure 2. Statistics of threads activated 1, 2, 3 times.
+
+Waiting once is normal and indicates 100% efficiency. Waiting twice suggests 50% efficiency, and waiting three times indicates 33.3% efficiency. Based on the figure, the overall activation efficiency is calculated to be 52.7%.
+
+To solve this problem, an ideal solution would be a multicast activation mechanism with 100% efficiency, where user threads with tx_commit_pending set to false are activated together. However, implementing this requires a deep understanding of the complex logic behind binlog group commit.
+
+In this case, a point-to-point activation mechanism is used, achieving 100% efficiency but introducing significant system call overhead. The following figure illustrates the relationship between TPC-C throughput and concurrency before and after optimization.
+
+
+
+Figure 3. Impact of group commit optimization with innodb_thread_concurrency=128.
+
+From the figure, it is evident that with innodb_thread_concurrency=128, the optimization of binlog group commit significantly improves throughput under high concurrency.
+
+It's important to note that this optimization's effectiveness can vary depending on factors such as configuration settings and specific scenarios. However, overall, it notably improves throughput, especially in high concurrency conditions.
+
+Below is the comparison of TPC-C throughput and concurrency before and after optimization using standard configurations:
+
+
+
+Figure 4. Impact of group commit optimization using standard configurations.
+
+From the figure, it is clear that this optimization is less pronounced compared to the previous one, but it still shows overall improvement. Extensive testing indicates that the worse the scalability of MySQL, the more significant the effectiveness of binlog group commit optimization.
+
+At the same time, the previously identified bottleneck of *_pthread_mutex_con_lock* has been significantly alleviated after optimization, as shown in the following figure:
+
+
+
+Figure 5. Mitigation of *_pthread_mutex_con_lock* bottleneck.
+
+In summary, this optimization helps address scalability problems associated with binlog group commit.
+
+## References:
+
+[1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
\ No newline at end of file
diff --git a/cats.md b/cats.md
new file mode 100644
index 0000000..0259e61
--- /dev/null
+++ b/cats.md
@@ -0,0 +1,171 @@
+# Evaluating Performance Gains in MySQL Lock Scheduling Algorithms
+
+Scheduling is crucial in computer system design. The right policy can significantly reduce mean response time without needing faster machines, effectively improving performance for free. Scheduling also optimizes other metrics, such as user fairness and differentiated service levels, ensuring some job classes have lower mean delays than others [1].
+
+MySQL 8.0 uses the Contention-Aware Transaction Scheduling (CATS) algorithm to prioritize transactions waiting for locks. When multiple transactions compete for the same lock, CATS determines the priority based on scheduling weight, calculated by the number of transactions a given transaction blocks. The transaction blocking the most others gets higher priority; if weights are equal, the longest waiting transaction goes first.
+
+A deadlock occurs when multiple transactions cannot proceed because each holds a lock needed by another, causing all involved to wait indefinitely without releasing their locks.
+
+After understanding the MySQL lock scheduling algorithm, let's examine how this algorithm affects throughput. Before testing, it is necessary to understand the previous FIFO algorithm and how to restore it. For relevant details, refer to the git log explanations provided below.
+
+```c++
+This WL improves the implementation of CATS to the point where the FCFS will be redundant (as often slower, and easy to "emulate" by setting equal schedule weights in CATS), so it removes FCFS from the code, further simplifying the lock_sys's logic.
+```
+
+Based on the above prompt, restoring the FIFO lock scheduling algorithm in MySQL is straightforward. Subsequently, throughput was tested using SysBench Pareto distribution scenarios with varying concurrency levels in the improved MySQL 8.0.32. Details are provided in the following figure.
+
+
+
+Figure 1. Impact of CATS on throughput at various concurrency levels.
+
+From the figure, it can be seen that the throughput of the CATS algorithm significantly exceeds that of the FIFO algorithm. To compare these two algorithms in terms of user response time, refer to the following figure.
+
+
+
+Figure 2. Impact of CATS on response time at various concurrency levels.
+
+From the figure, it can be seen that the CATS algorithm provides significantly better user response times.
+
+Furthermore, comparing deadlock error statistics during the Pareto distribution test process, details can be found in the following figure.
+
+
+
+Figure 3. Impact of CATS on ignored errors at various concurrency levels.
+
+Comparative analysis shows that the CATS algorithm significantly reduces deadlocks. This reduction in deadlocks likely plays a key role in improving performance. The theoretical basis for this correlation is as follows [2]:
+
+*Under a high-contention setting, the throughput of the target system will be determined by the concurrency control mechanism of the target system: systems which can release locks earlier or reduce the number of aborts will have advantages in such a setting.*
+
+The above test results align closely with MySQL's official findings. The following two figures, based on official tests [3], demonstrate the significant effectiveness of the CATS algorithm.
+
+
+
+Figure 4. Comparison of CATS and FIFO in TPS and mean latency: insights from the MySQL blog.
+
+Additionally, MySQL's official requirements for implementing the CATS algorithm are stringent. Specific details are provided in the following figure:
+
+
+
+Figure 5. Requirements of the official worklog for CATS.
+
+Therefore, with the adoption of the CATS algorithm, performance degradation should be absent in all scenarios. It seems like things end here, but the summary in the CATS algorithm's paper [1] raises some doubts. Details are provided in the following figure:
+
+
+
+Figure 6. Doubts about the CATS paper.
+
+From the information above, it can be inferred that either the industry has overlooked potential flaws in FIFO, or the paper's assessment is flawed, and FIFO does not have the serious problems suggested. This contradiction highlights a critical problem: one of these conclusions must be flawed; both cannot be correct.
+
+Contradictions often present valuable opportunities for in-depth problem analysis and resolution. They highlight areas where existing understanding may be challenged or where new insights can be gained.
+
+This time, testing on the improved MySQL 8.0.27 revealed significant bottlenecks under severe conflicts. The *perf* screenshot is shown in the following figure:
+
+
+
+Figure 7. The *perf* screenshot highlighting deadlock problems.
+
+Based on the figure, the bottleneck seems to be related to deadlock problems. The MySQL error log file shows numerous error logs, with a partial screenshot provided below:
+
+
+
+Figure 8. Partial screenshot of numerous error logs.
+
+Continuing the analysis of the corresponding code, the specifics are as follows:
+
+```c++
+void Deadlock_notifier::notify(const ut::vector &trxs_on_cycle,
+ const trx_t *victim_trx) {
+ ut_ad(locksys::owns_exclusive_global_latch());
+ start_print();
+ const auto n = trxs_on_cycle.size();
+ for (size_t i = 0; i < n; ++i) {
+ const trx_t *trx = trxs_on_cycle[i];
+ const trx_t *blocked_trx = trxs_on_cycle[0 < i ? i - 1 : n - 1];
+ const lock_t *blocking_lock =
+ lock_has_to_wait_in_queue(blocked_trx->lock.wait_lock, trx);
+ ut_a(blocking_lock);
+ print_title(i, "TRANSACTION");
+ print(trx, 3000);
+ print_title(i, "HOLDS THE LOCK(S)");
+ print(blocking_lock);
+ print_title(i, "WAITING FOR THIS LOCK TO BE GRANTED");
+ print(trx->lock.wait_lock);
+ }
+ const auto victim_it =
+ std::find(trxs_on_cycle.begin(), trxs_on_cycle.end(), victim_trx);
+ ut_ad(victim_it != trxs_on_cycle.end());
+ const auto victim_pos = std::distance(trxs_on_cycle.begin(), victim_it);
+ ut::ostringstream buff;
+ buff << "*** WE ROLL BACK TRANSACTION (" << (victim_pos + 1) << ")\n";
+ print(buff.str().c_str());
+ DBUG_PRINT("ib_lock", ("deadlock detected"));
+ ...
+ lock_deadlock_found = true;
+}
+```
+
+From the code analysis, it's clear that deadlocks lead to a substantial amount of log output. The ignored errors observed during testing are connected to these deadlocks. The CATS algorithm helps reduce the number of ignored errors, resulting in fewer log outputs. This problem can be consistently reproduced.
+
+Given this context, several considerations emerge:
+
+1. **Impact on Performance Testing:** The extensive error logs and the resulting disruptions could potentially skew the performance evaluation, leading to inaccurate assessments of the system's capabilities.
+2. **Effectiveness of the CATS Algorithm:** The performance improvement of the CATS algorithm may need re-evaluation. If the extensive output of error logs significantly impacts performance, its actual effectiveness may not be as high as initially believed.
+
+Remove all logs from the **Deadlock_notifier::notify** function, recompile MySQL, and perform SysBench read-write tests under Pareto distribution. Details are provided in the following figure:
+
+
+
+Figure 9. Impact of CATS on throughput at various concurrency levels for improved MySQL 8.0.27 after eliminating interference.
+
+From the figure, it is evident that there has been a significant change in throughput comparison. In scenarios with severe conflicts, the CATS algorithm slightly outperforms the FIFO algorithm, but the difference is minimal and much less pronounced than in previous tests. Note that these tests were conducted on the improved MySQL 8.0.27.
+
+Let's conduct performance comparison tests on the improved MySQL 8.0.32, with deadlock log interference removed, using Pareto distribution.
+
+
+
+Figure 10. Impact of CATS on throughput at various concurrency levels for improved MySQL 8.0.32 after eliminating interference.
+
+From the figure, it is evident that removing the interference results in only a slight performance difference. This small variation makes it understandable why the severity of FIFO scheduling problems may be difficult to notice. The perceived bias from CATS authors and MySQL officials likely stems from the extensive log output interference caused by deadlocks.
+
+Using the same 32 warehouses as in the CATS algorithm paper, TPC-C tests were conducted at various concurrency levels. MySQL was based on the improved MySQL 8.0.27, and BenchmarkSQL was modified to support 100 concurrent transactions per warehouse.
+
+
+
+Figure 11. Impact of CATS on throughput at different concurrency levels under NUMA after eliminating interference, according to the CATS paper.
+
+From the figure, it's evident that the CATS algorithm performs worse than the FIFO algorithm. To avoid NUMA-related interference, MySQL was bound to NUMA node 0 for a new round of throughput versus concurrency tests.
+
+
+
+Figure 12. Impact of CATS on throughput at different concurrency levels under SMP after eliminating interference, according to the CATS paper.
+
+In this round of testing, the FIFO algorithm continued to outperform the CATS algorithm. The decline in performance of the CATS algorithm in BenchmarkSQL TPC-C testing compared to improvements in SysBench Pareto testing can be attributed to the following reasons:
+
+1. **Additional Overhead**: The CATS algorithm inherently introduces some extra overhead.
+2. **NUMA Environment Problems**: The CATS algorithm may not perform optimally in NUMA environments.
+3. **Conflict Severity**: The conflict severity in TPC-C testing is less pronounced than in SysBench Pareto testing.
+4. **Different Concurrency Scenarios**: SysBench creates concurrency scenarios that differ significantly from those in BenchmarkSQL.
+
+Finally, standard TPC-C testing was performed again with 1000 warehouses at varying concurrency levels. Specific details are shown in the following figure:
+
+
+
+Figure 13. Impact of CATS on BenchmarkSQL throughput after eliminating interference.
+
+From the figure, it is evident that there is little difference between the two algorithms in low-conflict scenarios. In other words, the CATS algorithm does not offer significant benefits in situations with fewer conflicts.
+
+Overall, while CATS shows some improvement in Pareto testing, it is less pronounced than expected. The interference from deadlock log outputs during performance testing impacted the results. The CATS algorithm significantly reduces transaction deadlocks, leading to fewer log outputs and less performance degradation compared to the FIFO algorithm. When deadlock logs are suppressed, the difference between these algorithms is minimal, clarifying the confusion surrounding the CATS algorithm's performance [4].
+
+Database performance testing is inherently complex and error-prone [5]. It cannot be judged by data alone and requires thorough investigation to ensure logical consistency.
+
+## References:
+
+[1] B. Tian, J. Huang, B. Mozafari, and G. Schoenebeck. Contention-aware lock scheduling for transactional databases. PVLDB, 11(5), 2018.
+
+[2] Y. Wang, M. Yu, Y. Hui, F. Zhou, Y. Huang, R. Zhu, et al.. 2022. A study of database performance sensitivity to experiment settings, Proceedings of the VLDB Endowment, vol. 15, no. 7.
+
+[3] Sunny Bains. 2017. Contention-Aware Transaction Scheduling Arriving in InnoDB to Boost Performance. https://dev.mysql.com/blog-archive/.
+
+[4] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
+
+[5] M. Raasveldt, P. Holanda, T. Gubner, and H. Muhleisen. Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing. In 7th International Workshop on Testing Database Systems, DBTest, 2:1--2:6, 2018.
\ No newline at end of file
diff --git a/images/149ca62fd014bd12b60c77573a49757d.gif b/images/149ca62fd014bd12b60c77573a49757d.gif
new file mode 100644
index 0000000..9e902ed
Binary files /dev/null and b/images/149ca62fd014bd12b60c77573a49757d.gif differ
diff --git a/images/19a92bf7065627b28a3403c000eba095.png b/images/19a92bf7065627b28a3403c000eba095.png
new file mode 100644
index 0000000..4b9338e
Binary files /dev/null and b/images/19a92bf7065627b28a3403c000eba095.png differ
diff --git a/images/2eba13133f9eb007d89459cce5d4055b.png b/images/2eba13133f9eb007d89459cce5d4055b.png
new file mode 100644
index 0000000..01d10e2
Binary files /dev/null and b/images/2eba13133f9eb007d89459cce5d4055b.png differ
diff --git a/images/47c39411c3240713c75e848ef5ef4b59.png b/images/47c39411c3240713c75e848ef5ef4b59.png
new file mode 100644
index 0000000..21cda9c
Binary files /dev/null and b/images/47c39411c3240713c75e848ef5ef4b59.png differ
diff --git a/images/47f963cdd950abd91a459ffb66a3744e.png b/images/47f963cdd950abd91a459ffb66a3744e.png
new file mode 100644
index 0000000..d6b63db
Binary files /dev/null and b/images/47f963cdd950abd91a459ffb66a3744e.png differ
diff --git a/images/4ca52ffeebc49306e76c74ed9062257d.png b/images/4ca52ffeebc49306e76c74ed9062257d.png
new file mode 100644
index 0000000..8352220
Binary files /dev/null and b/images/4ca52ffeebc49306e76c74ed9062257d.png differ
diff --git a/images/4cc389ee95fbae485f1e014aad393aa8.gif b/images/4cc389ee95fbae485f1e014aad393aa8.gif
new file mode 100644
index 0000000..ed94ebd
Binary files /dev/null and b/images/4cc389ee95fbae485f1e014aad393aa8.gif differ
diff --git a/images/4f0ea97ad117848a71148849705e311e.png b/images/4f0ea97ad117848a71148849705e311e.png
new file mode 100644
index 0000000..d3d804e
Binary files /dev/null and b/images/4f0ea97ad117848a71148849705e311e.png differ
diff --git a/images/5136495bbdfbe2cefac98d74bd36a88f.png b/images/5136495bbdfbe2cefac98d74bd36a88f.png
new file mode 100644
index 0000000..1008a2d
Binary files /dev/null and b/images/5136495bbdfbe2cefac98d74bd36a88f.png differ
diff --git a/images/63cf3b8bd556c6b1ce5a0436883c8f7b.png b/images/63cf3b8bd556c6b1ce5a0436883c8f7b.png
new file mode 100644
index 0000000..95b9bcc
Binary files /dev/null and b/images/63cf3b8bd556c6b1ce5a0436883c8f7b.png differ
diff --git a/images/66a031ab14d56289d0987c65c73323af.png b/images/66a031ab14d56289d0987c65c73323af.png
new file mode 100644
index 0000000..91515dc
Binary files /dev/null and b/images/66a031ab14d56289d0987c65c73323af.png differ
diff --git a/images/853d21533f748c1c56a4151869a82a27.gif b/images/853d21533f748c1c56a4151869a82a27.gif
new file mode 100644
index 0000000..7cbe669
Binary files /dev/null and b/images/853d21533f748c1c56a4151869a82a27.gif differ
diff --git a/images/8f9080ee71094d948ab7592b449954bb.png b/images/8f9080ee71094d948ab7592b449954bb.png
new file mode 100644
index 0000000..667130a
Binary files /dev/null and b/images/8f9080ee71094d948ab7592b449954bb.png differ
diff --git a/images/a54faa33502b8c17066b1e2af09bdbb0.png b/images/a54faa33502b8c17066b1e2af09bdbb0.png
new file mode 100644
index 0000000..8af7b8b
Binary files /dev/null and b/images/a54faa33502b8c17066b1e2af09bdbb0.png differ
diff --git a/images/b89dd5988d7d7ead6923dbc2d20e146c.png b/images/b89dd5988d7d7ead6923dbc2d20e146c.png
new file mode 100644
index 0000000..2774a97
Binary files /dev/null and b/images/b89dd5988d7d7ead6923dbc2d20e146c.png differ
diff --git a/images/bb111919295b7678530a1adcfa8b7d29.png b/images/bb111919295b7678530a1adcfa8b7d29.png
new file mode 100644
index 0000000..429ce4b
Binary files /dev/null and b/images/bb111919295b7678530a1adcfa8b7d29.png differ
diff --git a/images/cba8723ae722e8d2d13b94e0cf1fda7a.png b/images/cba8723ae722e8d2d13b94e0cf1fda7a.png
new file mode 100644
index 0000000..8d66837
Binary files /dev/null and b/images/cba8723ae722e8d2d13b94e0cf1fda7a.png differ
diff --git a/images/ccb771014f600402fee72ca7134aea10.gif b/images/ccb771014f600402fee72ca7134aea10.gif
new file mode 100644
index 0000000..b4f9c1a
Binary files /dev/null and b/images/ccb771014f600402fee72ca7134aea10.gif differ
diff --git a/images/ce8a6d4b9cd6df4c48cf914fae8a70d2.png b/images/ce8a6d4b9cd6df4c48cf914fae8a70d2.png
new file mode 100644
index 0000000..10fb5bd
Binary files /dev/null and b/images/ce8a6d4b9cd6df4c48cf914fae8a70d2.png differ
diff --git a/images/da7d6c8c12d18b915018939970d2b911.png b/images/da7d6c8c12d18b915018939970d2b911.png
new file mode 100644
index 0000000..325d5c0
Binary files /dev/null and b/images/da7d6c8c12d18b915018939970d2b911.png differ
diff --git a/images/f25b3e3bc94bed108b0c454413f79873.png b/images/f25b3e3bc94bed108b0c454413f79873.png
new file mode 100644
index 0000000..de9cbae
Binary files /dev/null and b/images/f25b3e3bc94bed108b0c454413f79873.png differ
diff --git a/images/image-20240829101222447.png b/images/image-20240829101222447.png
new file mode 100644
index 0000000..27a67c4
Binary files /dev/null and b/images/image-20240829101222447.png differ
diff --git a/images/image-20240829101254601.png b/images/image-20240829101254601.png
new file mode 100644
index 0000000..7c8d9f0
Binary files /dev/null and b/images/image-20240829101254601.png differ
diff --git a/images/image-20240829101332034.png b/images/image-20240829101332034.png
new file mode 100644
index 0000000..19555ad
Binary files /dev/null and b/images/image-20240829101332034.png differ
diff --git a/images/image-20240829101534550.png b/images/image-20240829101534550.png
new file mode 100644
index 0000000..e26eb9a
Binary files /dev/null and b/images/image-20240829101534550.png differ
diff --git a/images/image-20240829101612063.png b/images/image-20240829101612063.png
new file mode 100644
index 0000000..fd5d9f4
Binary files /dev/null and b/images/image-20240829101612063.png differ
diff --git a/images/image-20240829101632142.png b/images/image-20240829101632142.png
new file mode 100644
index 0000000..8689d39
Binary files /dev/null and b/images/image-20240829101632142.png differ
diff --git a/images/image-20240829101650730.png b/images/image-20240829101650730.png
new file mode 100644
index 0000000..918435c
Binary files /dev/null and b/images/image-20240829101650730.png differ
diff --git a/images/image-20240829101712694.png b/images/image-20240829101712694.png
new file mode 100644
index 0000000..de4cd83
Binary files /dev/null and b/images/image-20240829101712694.png differ
diff --git a/images/image-20240829102323261.png b/images/image-20240829102323261.png
new file mode 100644
index 0000000..645c326
Binary files /dev/null and b/images/image-20240829102323261.png differ
diff --git a/images/image-20240829102344815.png b/images/image-20240829102344815.png
new file mode 100644
index 0000000..2454696
Binary files /dev/null and b/images/image-20240829102344815.png differ
diff --git a/images/image-20240829102642856.png b/images/image-20240829102642856.png
new file mode 100644
index 0000000..bc131b9
Binary files /dev/null and b/images/image-20240829102642856.png differ
diff --git a/images/image-20240829102703396.png b/images/image-20240829102703396.png
new file mode 100644
index 0000000..5da1311
Binary files /dev/null and b/images/image-20240829102703396.png differ
diff --git a/images/image-20240829102722393.png b/images/image-20240829102722393.png
new file mode 100644
index 0000000..7d1e450
Binary files /dev/null and b/images/image-20240829102722393.png differ
diff --git a/images/image-20240829103131857.png b/images/image-20240829103131857.png
new file mode 100644
index 0000000..f65fd74
Binary files /dev/null and b/images/image-20240829103131857.png differ
diff --git a/images/image-20240829103236734.png b/images/image-20240829103236734.png
new file mode 100644
index 0000000..8b14507
Binary files /dev/null and b/images/image-20240829103236734.png differ
diff --git a/images/image-20240829103259992.png b/images/image-20240829103259992.png
new file mode 100644
index 0000000..bf8e650
Binary files /dev/null and b/images/image-20240829103259992.png differ
diff --git a/images/image-20240829104512068.png b/images/image-20240829104512068.png
new file mode 100644
index 0000000..43f7e00
Binary files /dev/null and b/images/image-20240829104512068.png differ
diff --git a/images/image-20240829104533718.png b/images/image-20240829104533718.png
new file mode 100644
index 0000000..572f0af
Binary files /dev/null and b/images/image-20240829104533718.png differ
diff --git a/images/image-20240829104554155.png b/images/image-20240829104554155.png
new file mode 100644
index 0000000..a7be52c
Binary files /dev/null and b/images/image-20240829104554155.png differ
diff --git a/images/image-20240829104639402.png b/images/image-20240829104639402.png
new file mode 100644
index 0000000..60e815e
Binary files /dev/null and b/images/image-20240829104639402.png differ
diff --git a/images/image-20240829113916829.png b/images/image-20240829113916829.png
new file mode 100644
index 0000000..d21ddcb
Binary files /dev/null and b/images/image-20240829113916829.png differ
diff --git a/images/image-20240829113948830.png b/images/image-20240829113948830.png
new file mode 100644
index 0000000..179be21
Binary files /dev/null and b/images/image-20240829113948830.png differ
diff --git a/images/image-20240829114017540.png b/images/image-20240829114017540.png
new file mode 100644
index 0000000..65d4255
Binary files /dev/null and b/images/image-20240829114017540.png differ
diff --git a/images/image-20240829114037360.png b/images/image-20240829114037360.png
new file mode 100644
index 0000000..ee13c96
Binary files /dev/null and b/images/image-20240829114037360.png differ
diff --git a/images/image-20240829114100020.png b/images/image-20240829114100020.png
new file mode 100644
index 0000000..4b94eeb
Binary files /dev/null and b/images/image-20240829114100020.png differ
diff --git a/innodb_storage.md b/innodb_storage.md
new file mode 100644
index 0000000..e4ba093
--- /dev/null
+++ b/innodb_storage.md
@@ -0,0 +1,304 @@
+## Enhancing the InnoDB Storage Engine
+
+### 1.1 MVCC ReadView: Identified Problems
+
+A key component of any MVCC scheme is the mechanism for quickly determining which tuples are visible to which transactions. A transaction's snapshot is created by building a ReadView (RV) vector that holds the TXIDs of all concurrent transactions smaller than the transaction's TXID. The cost of acquiring a snapshot increases linearly with the number of concurrent transactions, even if the transaction only reads tuples written by a single committed transaction, highlighting a known scalability limitation [1].
+
+After understanding the scalability problems with the MVCC ReadView mechanism, let's examine how MySQL implements MVCC ReadView. Under the Read Committed isolation level, during the process of reading data, the InnoDB storage engine triggers the acquisition of the ReadView. A screenshot of part of the ReadView data structure is shown below:
+
+```c++
+private:
+ // Disable copying
+ ReadView(const ReadView &);
+ ReadView &operator=(const ReadView &);
+private:
+ /** The read should not see any transaction with trx id >= this
+ value. In other words, this is the "high water mark". */
+ trx_id_t m_low_limit_id;
+ /** The read should see all trx ids which are strictly
+ smaller (<) than this value. In other words, this is the
+ low water mark". */
+ trx_id_t m_up_limit_id;
+ /** trx id of creating transaction, set to TRX_ID_MAX for free
+ views. */
+ trx_id_t m_creator_trx_id;
+ /** Set of RW transactions that was active when this snapshot
+ was taken */
+ ids_t m_ids;
+ /** The view does not need to see the undo logs for transactions
+ whose transaction number is strictly smaller (<) than this value:
+ they can be removed in purge if not needed by other views */
+ trx_id_t m_low_limit_no;
+ ...
+```
+
+Here, *m_ids* is a data structure of type *ids_t*, which closely resembles *std::vector*. See the specific explanation below:
+
+```c++
+ /** This is similar to a std::vector but it is not a drop
+ in replacement. It is specific to ReadView. */
+ class ids_t {
+ typedef trx_ids_t::value_type;
+ /**
+ Constructor */
+ ids_t() : m_ptr(), m_size(), m_reserved() {}
+ /**
+ Destructor */
+ ~ids_t() { ut::delete_arr(m_ptr); }
+ /** Try and increase the size of the array. Old elements are copied across.
+ It is a no-op if n is < current size.
+ @param n Make space for n elements */
+ void reserve(ulint n);
+ ...
+```
+
+Algorithm for MVCC ReadView visibility determination, specifically refer to the *changes_visible* function below:
+
+```c++
+ /** Check whether the changes by id are visible.
+ @param[in] id transaction id to check against the view
+ @param[in] name table name
+ @return whether the view sees the modifications of id. */
+ [[nodiscard]] bool changes_visible(trx_id_t id,
+ const table_name_t &name) const {
+ ut_ad(id > 0);
+ if (id < m_up_limit_id || id == m_creator_trx_id) {
+ return (true);
+ }
+ check_trx_id_sanity(id, name);
+ if (id >= m_low_limit_id) {
+ return (false);
+ } else if (m_ids.empty()) {
+ return (true);
+ }
+ const ids_t::value_type *p = m_ids.data();
+ return (!std::binary_search(p, p + m_ids.size(), id));
+ }
+```
+
+From the code, it can be seen that the visibility algorithm works efficiently when concurrency is low. However, as concurrency increases, the efficiency of using binary search to determine visibility significantly decreases, particularly in NUMA environments.
+
+### 1.2 Solutions for Enhancing MVCC ReadView Scalability
+
+There are two fundamental approaches to improving scalability here [2]:
+
+*First, finding an algorithm that improves the complexity, so that each additional connection does not increase the snapshot computation costs linearly.*
+
+*Second, perform less work for each connection, hopefully reducing the total time taken so much that even at high connection counts the total time is still small enough to not matter much (i.e. reduce the constant factor).*
+
+For the first solution, adopting a multi-version visibility algorithm based on Commit Sequence Numbers (CSN) offers benefits [7]: *the cost of taking snapshots can be reduced by converting snapshots into CSNs instead of maintaining a transaction ID list.* Specifically, under the Read Committed isolation level, there's no need to replicate an active transaction list for each read operation, thereby improving scalability.
+
+Considering the complexity of implementation, this book opts for the second solution, which directly modifies the MVCC ReadView data structure to mitigate MVCC ReadView scalability problems.
+
+### 1.3 Improvements to the MVCC ReadView Data Structure
+
+In the ReadView structure, the original approach used a vector to store the list of active transactions. Now, it has been changed to the following data structure:
+
+```c++
+class ReadView {
+ ...
+ private:
+ // Disable copying
+ ReadView &operator=(const ReadView &);
+ public:
+ bool skip_view_list{false};
+ private:
+ unsigned char top_active[MAX_TOP_ACTIVE_BYTES];
+ trx_id_t m_short_min_id;
+ trx_id_t m_short_max_id;
+ bool m_has_short_actives;
+ /** The read should not see any transaction with trx id >= this
+ value. In other words, this is the "high water mark". */
+ trx_id_t m_low_limit_id;
+ /** The read should see all trx ids which are strictly
+ smaller (<) than this value. In other words, this is the low water mark". */
+ trx_id_t m_up_limit_id;
+ /** trx id of creating transaction, set to TRX_ID_MAX for free views. */
+ trx_id_t m_creator_trx_id;
+ /** Set of RW transactions that was active when this snapshot
+ was taken */
+ ids_t m_long_ids;
+ ...
+```
+
+Furthermore, corresponding code modifications were made in the related interface functions, as changes to the data structure necessitate adjustments to the internal code within these functions.
+
+This new MVCC ReadView data structure can be seen as a hybrid data structure, as shown in the following figure [3].
+
+
+
+Figure 1. A new hybrid data structure suitable for active transaction list in MVCC ReadView.
+
+Typically, online transactions are short rather than long, and transaction IDs increase continuously. To leverage these characteristics, a hybrid data structure is used: a static array for consecutive short transaction IDs and a vector for long transactions. With a 2048-byte array, up to 16,384 consecutive active transaction IDs can be stored, each bit representing a transaction ID.
+
+The minimum short transaction ID is used to differentiate between short and long transactions. IDs smaller than this minimum go into the long transaction vector, while IDs equal to or greater than it are placed in the short transaction array. For an ID in changes_visible, if it is below the minimum short transaction ID, a direct query is made to the vector, which is efficient due to the generally small number of long transactions. If the ID is equal to or above the minimum short transaction ID, a bitwise query is performed, with a time complexity of O(1), compared to the previous O(log n) complexity. This improvement enhances efficiency and reduces cache migration between NUMA nodes, as O(1) queries typically complete within a single CPU time slice.
+
+In addition to the previously mentioned transformation, similar modifications were applied to the global transaction active list. The original data structure used for this list is shown in the following code snippet:
+
+```c++
+ /** Array of Read write transaction IDs for MVCC snapshot. A ReadView would
+ take a snapshot of these transactions whose changes are not visible to it.
+ We should remove transactions from the list before committing in memory and
+ releasing locks to ensure right order of removal and consistent snapshot. */
+ trx_ids_t rw_trx_ids;
+```
+
+Now it has been changed to the following data structure:
+
+```c++
+ /** Array of Read write transaction IDs for MVCC snapshot. A ReadView would
+ take a snapshot of these transactions whose changes are not visible to it.
+ We should remove transactions from the list before committing in memory and
+ releasing locks to ensure right order of removal and consistent snapshot. */
+ trx_ids_t long_rw_trx_ids;
+ unsigned char short_rw_trx_ids_bitmap[MAX_SHORT_ACTIVE_BYTES];
+ int short_rw_trx_valid_number;
+ trx_id_t min_short_valid_id;
+ trx_id_t max_short_valid_id
+```
+
+In the *short_rw_trx_ids_bitmap* structure, *MAX_SHORT_ACTIVE_BYTES* is set to 65536, theoretically accommodating up to 524,288 consecutive short transaction IDs. If the limit is exceeded, the oldest short transaction IDs are converted into long transactions and stored in *long_rw_trx_ids*. Global long and short transactions are distinguished by *min_short_valid_id*: IDs smaller than this value are treated as global long transactions, while IDs equal to or greater are considered global short transactions.
+
+During the copying process from the global active transaction list, the *short_rw_trx_ids_bitmap* structure, which uses only one bit per transaction ID, allows for much higher copying efficiency compared to the native MySQL solution. For example, with 1000 active transactions, the native MySQL version would require copying at least 8000 bytes, whereas the optimized solution may only need a few hundred bytes. This results in a significant improvement in copying efficiency.
+
+After implementing these modifications, performance comparison tests were conducted to evaluate the effectiveness of the MVCC ReadView optimization. The figure below shows a comparison of TPC-C throughput with varying concurrency levels, before and after modifying the MVCC ReadView data structure.
+
+
+
+Figure 2. Performance comparison before and after adopting the new hybrid data structure in NUMA.
+
+From the figure, it is evident that this transformation primarily optimized scalability and improved MySQL's peak throughput in NUMA environments. Further performance comparisons before and after optimization can be analyzed using tools like *perf*. Below is a screenshot from *perf* at 300 concurrency, prior to optimization:
+
+
+
+Figure 3. Latch-related bottleneck observed in *perf* screenshot.
+
+From the figure, it can be seen that the first two bottlenecks were significant, accounting for approximately 33% of the overhead. After optimization, the *perf* screenshot at 300 concurrency is as follows:
+
+
+
+Figure 4. Significant alleviation of latch-related bottleneck.
+
+After optimization, as shown in the screenshot above, the proportions of the previous top two bottlenecks have been significantly reduced.
+
+Why does changing the MVCC ReadView data structure significantly enhance scalability? This is because accessing these structures involves acquiring a global latch. Optimizing the data structure accelerates access to critical resources, reducing concurrency conflicts and minimizing cache migration across NUMA nodes.
+
+The native MVCC ReadView uses a vector to store the list of active transactions. In high-concurrency scenarios, this list can become large, leading to a larger working set. In NUMA environments, both querying and replication can become slower, potentially causing a single CPU time slice to miss its deadline and resulting in significant context-switching costs. The theoretical basis for this aspect is as follows [4]:
+
+*Context-switches that occur in the middle of a logical operation evict a possibly larger working set from the cache. When the suspended thread resumes execution, it wastes time restoring the evicted working set.*
+
+Throughput improvement under the ARM architecture is evaluated next. Details are shown in the following figure:
+
+
+
+Figure 5. Throughput improvement under the ARM architecture.
+
+From the figure, it is evident that there is also a significant improvement under the ARM architecture. Extensive test data confirms that the MVCC ReadView optimization yields clear benefits in NUMA environments, regardless of whether the architecture is ARM or x86.
+
+How much improvement can this optimization achieve in a SMP environment?
+
+
+
+Figure 6. Performance comparison before and after adopting the new hybrid data structure in SMP.
+
+From the figure, it can be observed that after binding to NUMA node 0, the improvement from the MVCC ReadView optimization is not significant. This suggests that the optimization primarily enhances scalability in NUMA architectures.
+
+In practical MySQL usage, preventing excessive user threads from entering the InnoDB storage engine can significantly reduce the size of the global active transaction list. This transaction throttling mechanism complements the MVCC ReadView optimization effectively, improving overall performance. Combined with double latch avoidance, discussed in the next section, the TPC-C test results in the following figure clearly demonstrate these improvements.
+
+
+
+Figure 7. Maximum TPC-C throughput in BenchmarkSQL with transaction throttling mechanisms.
+
+### 1.4 Avoiding Double Latch Problems
+
+During testing after the MVCC ReadView optimization, a noticeable decline in throughput was observed under extremely high concurrency conditions. The specific details are shown in the following figure:
+
+
+
+Figure 8. Performance degradation at concurrency Levels exceeding 500.
+
+From the figure, it can be seen that throughput significantly decreases once concurrency exceeds 500. The problem was traced to frequent acquisitions of the *trx-sys* latch, as shown in the code snippet below:
+
+```c++
+ } else if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
+ MVCC::is_view_active(trx->read_view)) {
+ mutex_enter(&trx_sys->mutex);
+ trx_sys->mvcc->view_close(trx->read_view, true);
+ mutex_exit(&trx_sys->mutex);
+ }
+```
+
+The other code snippet is shown below:
+
+```c++
+ if (lock_type != TL_IGNORE && trx->n_mysql_tables_in_use == 0) {
+ trx->isolation_level =
+ innobase_trx_map_isolation_level(thd_get_trx_isolation(thd));
+ if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
+ MVCC::is_view_active(trx->read_view)) {
+ /* At low transaction isolation levels we let
+ each consistent read set its own snapshot */
+ mutex_enter(&trx_sys->mutex);
+ trx_sys->mvcc->view_close(trx->read_view, true);
+ mutex_exit(&trx_sys->mutex);
+ }
+ }
+```
+
+InnoDB introduces a global trx-sys latch during the view close process, impacting scalability under high concurrency. To address this, an attempt was made to remove the global latch. One of the modifications is shown in the code snippet below:
+
+```c++
+ } else if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
+ MVCC::is_view_active(trx->read_view)) {
+ trx_sys->mvcc->view_close(trx->read_view, false);
+}
+```
+
+The other modification is shown in the code snippet below:
+
+```c++
+ if (lock_type != TL_IGNORE && trx->n_mysql_tables_in_use == 0) {
+ trx->isolation_level =
+ innobase_trx_map_isolation_level(thd_get_trx_isolation(thd));
+ if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
+ MVCC::is_view_active(trx->read_view)) {
+ /* At low transaction isolation levels we let
+ each consistent read set its own snapshot */
+ trx_sys->mvcc->view_close(trx->read_view, false);
+ }
+ }
+```
+
+Using the MVCC ReadView optimized version, compare TPC-C throughput before and after the modifications. Details are shown in the following figure:
+
+
+
+Figure 9. Performance improvement after eliminating the double latch bottleneck.
+
+From the figure, it is evident that the modifications significantly improved scalability under high-concurrency conditions. To understand the reasons for this improvement, let's use the *perf* tool for further investigation. Below is the *perf* screenshot at 2000 concurrency before the modifications:
+
+
+
+Figure 10. Latch-related bottleneck observed in *perf* screenshot.
+
+From the figure, it is evident that the latch-related bottlenecks are quite pronounced. After the code modifications, here is the *perf* screenshot at 3000 concurrency:
+
+
+
+Figure 11. Significant alleviation of latch-related bottleneck.
+
+Even with higher concurrency, such as 3000, the bottlenecks are not pronounced. This suggests that the optimizations have effectively alleviated the latch-related performance problems, improving scalability under extreme conditions.
+
+Excluding the global latch before and after the *view_close* function call improves scalability, while including it severely degrades scalability under high concurrency. Although the *view_close* function operates efficiently within its critical section, frequent acquisition of the globally used *trx-sys* latch—used throughout the *trx-sys* subsystem—causes significant contention and head-of-line blocking. This issue, known as the 'double latch' problem, arises from both *view_open* and *view_close* requiring the global *trx-sys* latch. Notably, removing the latch from the final stage or using a new latch can significantly mitigate this problem.
+
+
+
+## References:
+
+[1] Adnan Alhomssi and Viktor Leis. 2023. Scalable and Robust Snapshot Isolation for High-Performance Storage Engines. Proc. VLDB Endow. 16, 6 (2023), 1426–1438.
+
+[2] Andres Freund. 2020. Improving Postgres Connection Scalability: Snapshots. techcommunity.microsoft.com.
+
+[3] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
+
+[4] Harizopoulos, S. and Ailamaki, A. 2003. A case for staged database systems. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). Asilomar, CA. Harizopoulos, S. and Ailamaki, A. 2003. A case for staged database systems. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). Asilomar, CA.
\ No newline at end of file
diff --git a/isolation.md b/isolation.md
index 67a518d..623ef79 100644
--- a/isolation.md
+++ b/isolation.md
@@ -10,7 +10,7 @@ The figure below presents results from the SysBench uniform test, where concurre
-Figure 2-8. SysBench read-write performance comparison with low conflicts under different isolation levels.
+Figure 1. SysBench read-write performance comparison with low conflicts under different isolation levels.
Below 400 concurrency, the differences are minor because of fewer conflicts in the uniform test. With fewer conflicts, the impact of lock strategies under different transaction isolation levels is reduced. However, Read Committed is mainly constrained by frequent acquisition of MVCC ReadView, resulting in performance inferior to Repeatable Read.
@@ -18,7 +18,7 @@ Continuing with the SysBench test under pareto distribution conditions, specific
-Figure 2-9. SysBench read-write performance comparison with high conflicts under different isolation levels.
+Figure 2. SysBench read-write performance comparison with high conflicts under different isolation levels.
The figure clearly illustrates that in scenarios with significant conflicts, performance differences due to lock strategies under different transaction isolation levels are pronounced. As anticipated, higher transaction isolation levels generally exhibit lower throughput, particularly under severe conflict conditions.
@@ -28,5 +28,6 @@ In summary, in low-conflict tests like SysBench uniform, the overhead of MVCC Re
## References:
-1. https://dev.mysql.com/doc/refman/8.0/en/.
-2. Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
\ No newline at end of file
+[1] https://dev.mysql.com/doc/refman/8.0/en/.
+
+[2] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
\ No newline at end of file
diff --git a/performance_degradation.md b/performance_degradation.md
new file mode 100644
index 0000000..8e48f40
--- /dev/null
+++ b/performance_degradation.md
@@ -0,0 +1,146 @@
+# Solved Performance Degradation in Query Execution Plans
+
+During secondary development on MySQL 8.0.27, TPC-C tests with BenchmarkSQL became unstable. Throughput rapidly declined, complicating optimization. Trust in the official version led to initially overlooking this problem despite testing difficulties.
+
+Only after a user reported a significant performance drop following an upgrade did we begin to take it seriously. The reliable feedback from users indicated that while MySQL 8.0.25 performed well, upgrading to MySQL 8.0.29 led to a substantial decline. This crucial information indicated that there was a performance problem.
+
+Simultaneously, it was confirmed that the performance degradation problem in MySQL 8.0.27 was the same as in MySQL 8.0.29. MySQL 8.0.27 had undergone two scalability optimizations specifically for trx-sys, which theoretically should have increased throughput. Reviewing the impact of latch sharding in trx-sys on performance:
+
+
+
+Figure 1. Impact of latch sharding in trx-sys under different concurrency levels.
+
+Let's continue examining the comparison of throughput and concurrency between trx-sys latch sharding optimization and the MySQL 8.0.27 release version. Specific details are shown in the following figure:
+
+
+
+Figure 2. Performance degradation in MySQL 8.0.27 release version.
+
+From the figure, it is evident that the performance degradation of the MySQL 8.0.27 release version is significant under low concurrency conditions, with a noticeable drop in peak performance. This aligns with user feedback regarding decreased throughput and is easily reproducible using BenchmarkSQL.
+
+The MySQL 8.0.27 release version already had this problem, whereas the earlier MySQL 8.0.25 release version did not. Using this information, the goal was to identify the specific git commit that caused the performance degradation. Finding the git commit responsible for performance degradation is a complex process that typically involves binary search. After extensive testing, it was initially narrowed down to a specific commit. However, this commit contained tens of thousands of lines of code, making it nearly impossible to pinpoint the exact segment causing the problem. It was later discovered that this commit was a collective merge from a particular branch. This allowed for further breakdown and ultimately identifying the root cause of the problem in the following:
+
+```c++
+commit 9a13c1c6971f4bd56d143179ecfb34cca8ecc018
+Author: Steinar H. Gunderson
+Date: Tue Jun 8 15:14:35 2021 +0200
+
+ Bug #32976857: REMOVE QEP_TAB_STANDALONE [range optimizer, noclose]
+
+ Remove the QEP_TAB dependency from test_quick_select() (ie., the range
+ optimizer).
+
+ Change-Id: Ie0fcce71dfc813920711c43c3d62635dae0d7d20
+```
+
+Using the commit information, two versions were compiled and SQL queries performing exceptionally slow in TPC-C tests were identified. The execution plans of these slow SQL queries were analyzed using *'explain'*. Specific details are shown in the following figure:
+
+
+
+Figure 3. Abnormalities indicated by rows in *'explain'*.
+
+From the figure, it can be seen that most of the execution plans are identical, except for the *'rows'* column. In the normal version, the *'rows'* column shows just over 200, whereas in the problematic version, it shows over 1,000,000. After continuously simplifying the SQL, a highly representative SQL query was finally identified. Specific details are shown in the following figure:
+
+
+
+Figure 4. Significant discrepancies between SQL execution results and *'explain'* output.
+
+Based on the *Filter* information obtained from '*explain*', the last query shown in the figure was constructed. The figure reveals that while the last query returned only 193 rows, '*explain*' displayed over 1.17 million rows for *'rows'*. This discrepancy highlights a complex problem, as execution plans are not always fully understood by all MySQL developers. Fortunately, identifying the commit responsible for the performance degradation provided a critical foundation for solving the problem. Although solving the problem was relatively straightforward with this information, analyzing the root cause from the SQL statement itself proved to be far more challenging.
+
+Let's continue with an in-depth analysis of this problem. The following figure displays the '*explain*' result for a specific SQL query:
+
+
+
+Figure 5. Sample SQL query representing the problem.
+
+From the figure, it can be seen that the number of rows is still large, indicating that this SQL query is representative.
+
+Two different debug versions of MySQL were compiled: one with anomalies and one normal. Debug versions were used to capture useful function call relationships through debug traces. When executing the problematic SQL statement on the version with anomalies, the relevant debug trace information is as follows:
+
+
+
+Figure 6. Debug trace information for the abnormal version.
+
+Similarly, for the normal version, the relevant debug trace information is as follows:
+
+
+
+Figure 7. Debug trace information for the normal version.
+
+Comparing the two figures above, it is noticeable that the normal version includes additional content within the green box, indicating that conditions are applied in the normal version, whereas the abnormal version lacks these conditions. To understand why the abnormal version is missing these conditions, it is necessary to add additional trace information in the *get_full_func_mm_tree* function to capture specific details about the cause of this difference.
+
+After adding extra trace information, the debug trace result for the abnormal version is as follows:
+
+
+
+Figure 8. Supplementary debug trace information for the abnormal version.
+
+The debug trace result for the normal version is as follows:
+
+
+
+Figure 9. Supplementary debug trace information for the normal version.
+
+Upon comparing the two figures above, significant differences are observed. In the normal version, the value of *param_comp* is 16140901064495857660, while in the abnormal version, it is 16140901064495857661, differing by 1. To understand this discrepancy, let's first examine how the *param_comp* value is calculated, as detailed in the following code snippet:
+
+```c++
+static SEL_TREE *get_full_func_mm_tree(THD *thd, RANGE_OPT_PARAM *param,
+ table_map prev_tables,
+ table_map read_tables,
+ table_map current_table,
+ bool remove_jump_scans, Item *predicand,
+ Item_func *op, Item *value, bool inv) {
+ SEL_TREE *tree = nullptr;
+ SEL_TREE *ftree = nullptr;
+ const table_map param_comp = ~(prev_tables | read_tables | current_table);
+ DBUG_TRACE;
+ ...
+```
+
+From the code, it's evident that *param_comp* is calculated using a bitwise OR operation on three variables, followed by a bitwise NOT operation. The difference of 1 suggests that at least one of these variables differs, helping to narrow down the problem.
+
+The calculation involves three *table_map* variables with lengthy values, making ordinary calculators insufficient and the process too complex to detail here.
+
+The key point is that debug tracing revealed critical differences. Coupled with the information provided by identifying the Git commit responsible for the performance discrepancy, analyzing the root cause is no longer difficult.
+
+Here is the final fix patch, detailed as follows:
+
+
+
+Figure 10. Final patch for solving performance degradation in query execution plans.
+
+When calling the *test_quick_select* function, reintroduce the *const_table* and *read_tables* variables (related to the previously discussed variables). This ensures that filtering conditions in the execution plan are not overlooked.
+
+After applying the above patch to MySQL 8.0.27, the performance degradation problem was solved. A test comparing TPC-C throughput at various concurrency levels, both before and after applying the patch, was conducted. Specific details are shown in the following figure:
+
+
+
+Figure 11. Effects of the patch on solving performance degradation.
+
+From the figure, it is evident that after applying the patch, throughput and peak performance have significantly improved under low concurrency conditions. However, under high concurrency conditions, throughput not only failed to increase but actually decreased, likely due to scalability bottlenecks in MVCC ReadView.
+
+After addressing the MVCC ReadView scalability problem, reassess the impact of this patch, as detailed in the figure below:
+
+
+
+Figure 12. Actual effects of the patch after addressing the MVCC ReadView scalability problem.
+
+From the figure, it is evident that this patch has significantly improved MySQL's throughput. This case demonstrates that scalability problems can disrupt certain optimizations. To scientifically assess the effectiveness of an optimization, it is essential to address most scalability problems beforehand to achieve a more accurate evaluation.
+
+Finally, let's examine the results of the long-term stability testing for TPC-C. The following figure shows the results of an 8-hour test under 100 concurrency, with throughput captured at various hours (where 1 ≤ n ≤ 8).
+
+
+
+Figure 13. Comparison of stability tests: MySQL 8.0.27 vs. improved MySQL 8.0.27.
+
+From the figure, it is evident that after applying the patch, the rate of throughput decline has been significantly mitigated. The MySQL 8.0.27 version experienced a dramatic throughput decline, failing to meet the stability requirements of TPC-C testing. However, after applying the patch, MySQL's performance returned to normal.
+
+Addressing this problem directly presents considerable challenges, particularly for MySQL developers unfamiliar with query execution plans. Using logical reasoning and a systematic approach to identify and address code differences before and after the problem arose is a more elegant problem-solving method, though it is complex.
+
+It is noteworthy that no regression testing problems were encountered after applying the patch, demonstrating high stability and providing a solid foundation for future performance improvements. Currently, MySQL 8.0.38 still hasn't solved this problem, suggesting potential shortcomings in MySQL's testing system. Given the complexity of MySQL databases, users should exercise caution when upgrading and consider using tools like TCPCopy [2] to avoid potential regression testing problems.
+
+## References:
+
+[1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
+
+[2] https://github.com/session-replay-tools/tcpcopy.
\ No newline at end of file
diff --git a/pgo.md b/pgo.md
new file mode 100644
index 0000000..58d94b2
--- /dev/null
+++ b/pgo.md
@@ -0,0 +1,56 @@
+# PGO
+
+Profile-guided optimization (PGO) typically improves program execution efficiency. The following figure illustrates how PGO improves the throughput of a standalone MySQL instance under various concurrency levels, following the resolution of MySQL MVCC ReadView scalability problems.
+
+
+
+Figure 1. Impact of PGO after solving MVCC ReadView scalability problems.
+
+From the figure, it is evident that PGO has a notable impact.
+
+For MySQL 8.0.27 with PGO, throughput decreases under high concurrency conditions. The specific details are shown in the figure below:
+
+
+
+Figure 2. Performance comparison tests before and after using PGO in MySQL 8.0.27.
+
+The test results above indicate that PGO for MySQL's improvement requires addressing scalability problems before its full potential can be realized. It should be noted that both comparative tests above were conducted in mainstream NUMA environments. When MySQL is bound to a single NUMA node, creating an SMP environment, the following figure shows the relationship between TPC-C throughput and concurrency levels before and after PGO.
+
+
+
+Figure 3. Performance comparison tests before and after using PGO in MySQL 8.0.27 under SMP.
+
+From the figure, it can be seen that PGO consistently improves throughput in SMP environments, without decreasing as concurrency levels increase. The following figure compares the performance improvement of PGO between NUMA and SMP environments.
+
+
+
+Figure 4. Performance of PGO optimization in different environments.
+
+From the figure, it is evident that PGO achieves a maximum performance improvement of up to 30% in SMP environments, whereas in NUMA environments, the performance improvement decreases as concurrency increases. This suggests that PGO has greater potential in SMP environments.
+
+Continuing the analysis, the performance of PGO in a Group Replication cluster environment compared to a single MySQL instance is examined. The following diagram depicts a simplified queue model of Group Replication.
+
+
+
+Figure 5. A simplified queue model of Group Replication.
+
+Because the network portion cannot be optimized by PGO, the MySQL primary consumes a lower proportion of time compared to a single MySQL instance. According to Amdahl's Law, the performance gains from PGO will be less pronounced compared to those of a standalone MySQL instance. Generally, as network latency increases, the improvement from PGO tends to diminish.
+
+The following figure compares the throughput improvement of a standalone MySQL instance and Group Replication using PGO.
+
+
+
+Figure 6. PGO Performance Improvement in Group Replication vs. Standalone MySQL.
+
+From the figure, it can be observed that the performance improvement from PGO in a Group Replication cluster environment is generally less than that of a standalone MySQL instance.
+
+In conclusion, PGO can be summarized as follows:
+
+1. For MySQL, PGO is a worthwhile optimization that theoretically improves performance comprehensively, especially in SMP environments.
+2. In NUMA environments, addressing scalability problems is necessary to achieve significant benefits from PGO.
+3. PGO is less effective in a Group Replication cluster compared to a standalone MySQL instance.
+
+# References:
+
+[1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
+
diff --git a/scalability.md b/scalability.md
index bc868c6..160ad2b 100644
--- a/scalability.md
+++ b/scalability.md
@@ -196,9 +196,13 @@ Overall, it is entirely feasible for MySQL to maintain performance without colla
# References:
-1. Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
-2. https://dev.mysql.com/blog-archive/the-new-mysql-thread-pool/.
-3. Paweł Olchawa. 2018. MySQL 8.0: New Lock free, scalable WAL design. dev.mysql.com/blog-archive.
-4. Xiangyao Yu. An evaluation of concurrency control with one thousand cores. PhD thesis, Massachusetts Institute of Technology, 2015.
-5. https://dev.mysql.com/doc/refman/8.0/en/.
+[1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
+
+[2] https://dev.mysql.com/blog-archive/the-new-mysql-thread-pool/.
+
+[3] Paweł Olchawa. 2018. MySQL 8.0: New Lock free, scalable WAL design. dev.mysql.com/blog-archive.
+
+[4] Xiangyao Yu. An evaluation of concurrency control with one thousand cores. PhD thesis, Massachusetts Institute of Technology, 2015.
+
+[5] https://dev.mysql.com/doc/refman/8.0/en/.
diff --git a/sysbench_vs_benchmarksql.md b/sysbench_vs_benchmarksql.md
index 58b3478..92b8608 100644
--- a/sysbench_vs_benchmarksql.md
+++ b/sysbench_vs_benchmarksql.md
@@ -6,7 +6,7 @@ First, use SysBench's standard read/write tests to evaluate the optimization of
-Figure 5-21. Comparison of SysBench read-write tests before and after lock-sys optimization.
+Figure 1. Comparison of SysBench read-write tests before and after lock-sys optimization.
From the figure, it can be observed that after optimization, the overall performance of the SysBench tests has actually decreased.
@@ -14,7 +14,7 @@ Next, using BenchmarkSQL to test this optimization, the results are shown in the
-Figure 5-22. Comparison of BenchmarkSQL tests before and after lock-sys optimization.
+Figure 2. Comparison of BenchmarkSQL tests before and after lock-sys optimization.
From the figure, it can be seen that the results of BenchmarkSQL's TPC-C test indicate that the lock-sys optimization is effective. Why does such a significant difference occur? Let's analyze the differences in characteristics between these testing tools to understand why their tests differ.
@@ -32,5 +32,6 @@ It is worth noting that the main basis for performance testing and comparison in
## References:
-1. Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
-2. R. N. Avula and C. Zou. Performance evaluation of TPC-C benchmark on various cloud providers, Proc. 11th IEEE Annu. Ubiquitous Comput. Electron. Mobile Commun. Conf. (UEMCON), pp. 226-233, Oct. 2020.
\ No newline at end of file
+[1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
+
+[2] R. N. Avula and C. Zou. Performance evaluation of TPC-C benchmark on various cloud providers, Proc. 11th IEEE Annu. Ubiquitous Comput. Electron. Mobile Commun. Conf. (UEMCON), pp. 226-233, Oct. 2020.
\ No newline at end of file