Optimizer, what are you doing?

Note: I am currently working on setting up a generic example of this behavior – so please forgive the lack of detail.

I recently started a contracting position, where they had a huge table (14 GB), with an even bigger index (64 GB), and a problem with not being able to use the index. That means “outage” in the DBA playbook – and it was constantly occurring. Without a dedicated MySQL expert – they eventually tried indexing every column of the table – hoping one would stick. But it wouldn’t. And after a few days of learning their system – I eventually figured out why.

It may be a bit difficult to provide all the information due to non-disclosure agreements, and to ensure the confidentiality of the system – but I’m going to try – as this is probably a common production issue.

So let’s start at the beginning….

The complaint was that several queries were causing high CPU and taking longer than 30 secs to complete. After running an EXPLAIN – it was easy to see why, but not necessarily why it was happening. The query ran, and it was taking 8-10 mins to do so – with no other queries running at the same time. So it wasn’t table locking related. The EXPLAIN extra was “using where”, and it ref’ed the column (with high cardinality) in the order by – which wasn’t a conditional in the where clause.

So – my first thought was that maybe the issue was an EXISTs function in the where clause. I put together the query with an inner join instead, which dropped the query time from 8-10 mins to .02 secs. Okay – maybe this query was fixed – sent it to development to update in code.

Which is when I found out that they are using an ORM, which to be able to sort by any column, and the query itself was generated with no guaranteed order, sort or anything else. This kept me from using things like “USE INDEX” or even the inner join (because they could also do a NOT EXISTS())

I started thinking of a way to get it past the optimizer – especially since there were better indexes to use – so I did the DBA thing: ANALYZE TABLE.

Same Result…

The column was high cardinality – so it made sense for the optimizer to choose it . . . except – it shouldn’t have been choosing it.

The column in the order by didn’t exist in the conditionals (in the WHERE clauses) so it couldn’t actually use the index – even though it was ref’ing it. It didn’t even use the index for sort (at least, it wasn’t telling me that it was).

So I googled a bit on “optimizer chooses the wrong index.” That, of course, was a rabbit hole. Apparently, optimizer does that all the time, but for certain reasons of course.

This was different. It took a couple of days, including an outage caused by long running queries, when I found a second, similar query, this time with a different column (also high cardinality). I bet on this as a pattern and did a search “mysql optimizer choosing order by” and found a similar issue in Percona’s blog under “Naughty Aberrations on Queries Combine WHERE, ORDER BY and LIMIT” (all of which was in the query).

https://www.percona.com/community-blog/2019/07/29/mysql-optimizer-naughty-aberrations-on-queries-combining-where-order-by-and-limit/

Although I tried the workarounds written in the blog, especially adding “+0” to the column name, the workarounds didn’t work. Adding “+0” to a numeric column was fine, and sort worked as expected, but for a string, it broke completely. I actually had expected for it to work regardless as it had always been my understanding that column names are aliases for the column number – but I was wrong.

Inevitably, the fix was to add a second, throw-away column to the order by that was almost guaranteed to not affect the originally query. Choosing from the lowest cardinality columns, I found one that was either a 1 or NULL, with 98% of the values being NULL (I did not design this table, and am only there to fix certain things). It worked with the ORM, and kept the integrity of the queries. It also dropped the query times from 8 mins, to approx .2-.6 secs, which is far faster than 8 mins.

So what was it doing? The fix was to keep the optimizer from choosing the order by as its referenced key. The workaround is not 100% perfect, but should keep revenue flowing – as its removing queries from an 8-10 minute table scan.

Seriously, optimizer…. seriously.