Monday, April 4, 2016

MySQL 5.7: Optimizer finds best index by expression

The optimizer in MySQL 5.7 leverages generated columns. Generated columns will physically store data in two cases: Either the column is defined as STORED or you create an index on a virtual column. The optimizer will leverage such an index automatically if it encounters the same expression in a statement. Let's see an example:

mysql> DESC squares;
+-------+------------------+------+-----+---------+-------+
| Field | Type             | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| dx    | int(10) unsigned | YES  |     | NULL    |       |
| dy    | int(10) unsigned | YES  |     | NULL    |       |
+-------+------------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql> SELECT COUNT(*) FROM squares;
+----------+
| COUNT(*) |
+----------+
|  2097152 |
+----------+
1 row in set (0.77 sec)


We have a large table with 2 million rows. Selecting rows by the surface area of squares can hardly leverage an index on dx or dy:

mysql> EXPLAIN SELECT * FROM squares WHERE dx*dy=221\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: squares
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2092860
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Now let's add an index over a generated, virtual column that defines the area:

mysql> ALTER TABLE squares ADD COLUMN (area INT AS (dx*dy));
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE squares ADD INDEX (area);
Query OK, 0 rows affected (5.24 sec)
Records: 0  Duplicates: 0  Warnings: 0


Now we can run query again:

mysql> EXPLAIN SELECT * FROM squares WHERE dx*dy=221\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: squares
   partitions: NULL
         type: ref
possible_keys: area
          key: area

      key_len: 5
          ref: const
         rows: 18682
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)

 I did not change the query! The WHERE condition is still dx*dy. Nevertheless the optimizer finds the generated column, sees the index and decides to leverage that.
So you can add complex indexes and without changing the application code you can benefit from these indexes. That makes life much easier.

One limitation though: It seems the optimizer recognizes expressions only in the WHERE clause. It will not use the generated column and index for the SELECT expression:

mysql> EXPLAIN SELECT SUM(dx*dy) FROM squares\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: squares
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL

      key_len: NULL
          ref: NULL
         rows: 2092860
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT SUM(area) FROM squares\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: squares
   partitions: NULL
         type: index
possible_keys: NULL
          key: area
      key_len: 5
          ref: NULL
         rows: 2092860
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)


2 comments:

  1. The aggregate function is indeed not using an index, but also something simpler like this is not (yet) using the index:

    CREATE TABLE squares (dx int unsigned, dy int unsigned);
    insert into squares select round(rand()*100,0), round(rand()*100,0);
    insert into squares select round(rand()*100,0), round(rand()*100,0) from squares; -- until COUNT(*) == 2097152
    ALTER TABLE squares ADD COLUMN (area INT AS (dx*dy));
    ALTER TABLE squares ADD INDEX (area);
    ALTER TABLE squares ADD INDEX `dx__area` (dx, area);
    EXPLAIN SELECT * FROM squares WHERE dx*dy=221\G
    EXPLAIN SELECT dx*dy FROM squares WHERE dx=5\G
    EXPLAIN FORMAT=JSON SELECT dx*dy FROM squares WHERE dx=5\G
    EXPLAIN FORMAT=JSON SELECT area FROM squares WHERE dx=5\G

    ReplyDelete
    Replies
    1. Good point. Your example is even simpler without the aggregation function.
      I also tested ... WHERE dy*dx=221; so reverse order of factors. That is mathematically the same but the optimizer does not recognize the expression. However I doubt that is a big issue. If your application is really creating both types of expressions either create both virtual columns with indexes (probably costly) or use the Query Rewrite Plugin.

      Delete