Catching and fixing regressions in the MySQL Ecosystem

Posted by

·

 

Catching and fixing regressions in the MySQL Ecosystem

Guest blog by Amrendra Kumar, former MySQL Performance Engineer

I worked the past 14 years in the MySQL Performance & QA team in Oracle. Like many others I was recently laid off. I thought of sharing my experience of how we use continuous benchmarking to prevent performance regressions in MySQL releases. In my opinion the system we built over the years, was quite effective in preventing many regressions from reaching GA releases.

We had very robust structure of continuous benchmarking for MySQL Server, and for MySQL in OCI cloud, for trunk (what in git is called “main” or “master”, but “trunk” is what it was called in bzr and cvs, which MySQL used before git existed) and the latest LTS release (which in my time was 8.4), and we supported both Oracle Linux and Windows platform.

We developed a dashboard through which we were keeping track of performance of nightly builds for trunk and 8.4. In the dashboard we have details like test-name, type of test – daily/weekly test, platform (linux/windows), commit of the build and difference in percentage, negative or positive, with respect to a baseline result. The baseline we would set periodically to some recent build, that had been triaged and well understood, or maybe because the infrastructure was reconfigured, or after regressions had been fixed.

Something like this:

TestScheduleBranchPlatform2026-03-152026-03-162026-03-172026-03-182026-03-19
Point-selectDailytrunkLinux+0.5+6.0+6.0+6.1+5.9
Point-selectDailytrunkWindows+0.2-4.0-4.0-4.1-4.2

(The above is just an example with fake numbers, as the dashboard isn’t public.)

As we can see that regression in windows started from 16/03 and it stayed for few days that means definitely, regression in windows and gain in linux for point-select test of sysbench introduced in daily-trunk of 16/03/26 nightly build. Nightly build is just all commits for day is combined into a single test run build was created. Since performance benchmarks run longer and on larger instances than QA tests, it is common that many benchmarks aren’t run for every commit. But for a large project like MySQL, daily/nightly is a good frequency.

Next step would then be to reproduce the regression on similar hardware to rule out environment change which is most unlikely, but better to rule out. In fact, if a performance regression only happens on one kind of hardware it might not be a real regression at all or in any case is not worth optimizing for just one type of CPU, cache size, and so on… But if we can reproduce the regression on my workstation and then on the developers workstation, then there is some clear problem that affects everybody and is not caused by the environment.

If it is reproduced then using git bisect good and git bisect bad , technique we were able to track down commit caused gain/regression in just 1 day.

Yes, someone has to watch this dashboard to make a judgmental call like it’s real regression/gain and not an environmental issue that goes away the next day. So in a team of 3 performance engineers we had a quarterly rotation to watch the dashboard.

Consider, instead, a situation like this.

Test nameScheduleBranchPlatform2026-02-122026-05-12
Point-selectGA ReleaseRelease-9.xLinux0.5+6.9
Point-selectGA ReleaseRelease-9.xWindows0.2-4.0

If we only focus on performance before (or after…) a release:

  1. You have, 90 days builds to use to hunt down the issue.
  2. You have, limited time to release so any regression/gain hunting difficult to do during release and can be done only after release. If problem, is easy it can be fixed during release time but if it is complex than fix will go post release , as bug fix. And often performance bugs are not easy
  3. This means , if this release, with a known regression, is deployed in production like on cloud or live production then its a risk and it needs to be patched later which has high cost. And if a customer is affected and reports this bug, we might not be able to fix it very quickly.

In my view this kind of approach where performance issues are mostly in focus during release or post-release is not right approach. As pro-active approach issue should be identified before the release, not during or after the release. Preferably even well before the release testing starts – when the patch containing a regression is merged. (Or ideally, it is not merged at all.)

We have uncovered many issues which helped to keep in check the quality of MySQL in OCI cloud or MySQL Server. Whether it was a new feature like new DDL, or new redolog, or refactoring of undo, or new threadpool implementation, or bug fixes, our continuous benchmarking process with the above dashboard has definitely helped MySQL to keep in check the performance.

In our dashboard we provided a way to go back and see historical data also and see trendline.

The main thing that could be improved in this system is that it’s still fairly labor intensive. It requires a performance specialist to triage the results and also for the reproduction and analysis.

Discovering Nyrkiö

I found Nyrkiö addressing this issue in very easy manner. It’s a SaaS based solution and you can keep adding test cases when you require more test coverage. It’s easy to setup, using curl and standard REST calls with a JSON payload. I have tried a POC which gave me steady results which I can trust with no noise in the result.

Here is the snippet of sysbench’s oltp-ro test on 16 core client with 16 threads.

  • As you can see latency graph is very stable. The variation is +/- 20 microseconds, which is actually limited by how sysbench measures time rather than the Nyrkiö test runner.
  • If you click each point, you can see the commit and go directly into commit change.

Nyrkiö allows users to publish their results. The above graph you can see here.

You can either test for each commit or nightly build in your own github.
You will have historically data points in trendline to see when the gain or regression started happening and which commit caused it.

With automated change detection you can identify issues much before your code bloats and hunting will take time.

One good thing about Nyrkiö were the Nyrkiö 3rd party runners for Github, which are servers configured to minimize noise caused by the infrastructure. This means you have better quality data to work with from the start and giving us
clear trendline to make out when gain/regression happened.

But even if you have more noise in your benchmark results than I got above, the change detection algorithm used by Nyrkiö is designed to filter out the noise and find the true points where the mean of the data changes permanently. This is based on the open source Apache Otava (incubating) project. You can read a very detailed explanation of the algorithm here if you’re interested in the internals.

Looking at a MariaDB regression

To also work with a timeseries that includes a regression, I have looked into a problem Mark Callaghan reported for points-covered-pk regression in MariaDB. According to Mark this regression happened somewhere between 10.4.34 and 10.5.29.

I reproduced Mark’s finding and reported it as MDEV-39574. Note that my infrastructure is significantly smaller than Mark’s, only 6 CPUs, so mutex contention may behave differently for us.

I was interested to pinpoint the exact patch release that had introduced the regression. So I ran points-covered-pk test for the MariaDB builds between 10.4.34 and 10.5.29. Results are in the below table and graph.

Ps-covered-pk, 16 threads, on: 6 core AMD ubuntu 24.0
8 tab x 0.5 million rows=1 GB data, config: 2 GB BP, AHI=on, PFS=off

VersionThroughput (qps)DifferenceRelease dateNote
10.4.34196922024-05-16Stable (GA) 
10.5.0808041%2019-12-03Alpha
10.5.1750993%2020-02-14 Beta
10.5.218673249%2020-03-26Beta
10.5.31844299%2020-05-12RC
10.5.41503082%2020-06-24Stable (GA)
10.5.517765118%2020-08-10Stable (GA)
10.5.618444104%2020-10-07Stable (GA)
10.5.7741840%2020-11-03Stable (GA)
10.5.817482236%2020-11-11Stable (GA) 
10.5.91700697%2021-02-22Stable (GA)
10.5.101647597%2021-05-07Stable (GA)
10.5.1117648107%2021-06-23Stable (GA)
10.5.121753299%2021-08-06Stable (GA)
10.5.1317452100%2021-11-08Stable (GA)
10.5.1517475100%2022-02-12Stable (GA)
10.5.1617702101%2022-05-20Stable (GA)
10.5.1718090102%2022-08-15Stable (GA)
10.5.181785999%2022-11-07Stable (GA)
10.5.191767899%2023-02-06Stable (GA)
10.5.2018079102%2023-05-10Stable (GA)
10.5.2118195101%2023-06-07Stable (GA)
10.5.221752896%2023-08-14Stable (GA)
10.5.231697797%2023-11-13Stable (GA)
10.5.2417815105%2024-02-07Stable (GA)
10.5.2517845100%2024-05-16Stable (GA)
10.5.2618030101%2024-08-08Stable (GA)
10.5.2718173101%2024-11-01Stable (GA)
10.5.281800299%2025-02-04Stable (GA)
10.5.291784699%2025-05-08Stable (GA)

The history of this benchmark – even when I only tested each patch release – is interesting! In MariaDB 10.5.0 (an alpha release) and 10.5.1 (beta release) the performance drops significantly: over 70% in total. But in 10.5.2 (another beta) this is fixed. And so the way is open to release a MariaDB GA == Stable release 10.5.4 in June 2020.

Then in 10.5.7 – 5 months later – regression returned with a drop of 60 %, very similar magnitude as observed by Mark C in the Small Datum blog post around these versions.

In fact MariaDB fixes this regression (at least most of it) by 10.5.8 and the performance of this benchmark has been quite stable since. So it is not the case the regression went unfixed for years from 10.4.34 to 10.5.29. This is of course commendable and shows that MariaDB cares about fixing performance regressions as soon as they are made aware of them.

But the point I’m trying to make with this blog post is that they can only fix something that they are aware of. In this case the second regression at 10.5.7 could have been avoided, if there had been Continuous Benchmarking in place at least after the first time this test showed regressions during the 10.5.1 beta. Recurring regressions in the same area is what happens if you don’t actively guard against returning regressions with Continuous Benchmarking.

I know that my former team mate from Oracle performance team, Jonathan Miller is now working on building a similar system to introduce continuous benchmarking also to MariaDB. It’s called the Test Automation Framework. This historical example from MariaDB 10.5.x series is a good example of a case that should be detected and prevented by such a system.

As part of this work I made my first contribution to TAF.

Henrik Ingo Avatar

About the author

Discover more from Nyrkiö Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading