Donnerstag, August 24, 2006

Basic documentation

While the work to refactor the DuplicateIndexFinder is going on (however not quite as fast as I would like it to), here is some general information about what exists already.

Why does it exist?

The DuplicateIndexFinder class was written, because we have an application that works with a database schema of some 500 tables. Those tables are generated by an object relation persistence mapper from UML diagrams. While the framework used (proprietary) works very well and allows for quick and efficient application development, the generated schema is not perfect.

The schema contains a lot of generically generated index definitions that are not necessary from the database perspective. For example for all tables an index that covers the primary key fields is created, in addition to the primary key. This is especially useless, because it will never be used by the optimizer.

However this is not the only problem. As the schema is generated automatically and the generator tool does not analyze table depencies in advance for the whole schema, a good deal of indices are created that cover a prefix of a longer index.

While there may be situations where indices become very long and a second, shorter index on a prefix may be sensible, usually the extra space used in the tablespace is not worth it. As we are dealing with several hundred databases, each about 100GB in size, superfluous indices can become a real pain in the neck. This is why I hacked this little tool.

What can it do?

Currently the whole point of the MySQL Index Analyzer is really just the detection of duplicate/superfluous indices. However I plan to include much more functionality over time. Right now you can let it look at a database and it will output a text report of which indices are obsoleted by other, longer indices on the same table. Moreover it can generate DDL statements in MySQL syntax that remove the unneeded indices from the tables, therefore releasing the space they occupied before.

How to use it?

The DuplicateIndexFinder is a single Java class that can be run as a command line application. Apart from a Java Runtime Environment you will also need the MySQL JDBC Driver (Connector/J) and the JSAP library (Java-based Simple Argument Parser). The tool has been written using Java 5,but apart from some generic collections nothing fancy is used. I believe it should be possible to compile Java 1.4 compliant version, even though I did not try it myself.

To get an overview of its usage, just run it as follows on a command line:

java de.shipdown.util.mysql.DuplicateIndexFinder --help

This will output general usage information and short explanations for the available options:


Usage: java de.shipdown.util.mysql.DuplicateIndexFinder
                [--help] [(-o|--outfile) <outfile>] [-v|--verbose] [-a|--generate-alter-table] (-D|--database) <dbname> (-u|--user) <dbuser> [(-p|--password) <dbpass>] [(-h|--host) <dbhost>] [(-P|--port) <dbport>]

  [--help]
        Prints this help message.

  [(-o|--outfile) <outfile>]
        Name of an output file. StdOut if omitted.

  [-v|--verbose]
        Requests verbose output (includes column names).

  [-a|--generate-alter-table]
        Generate alter table statements to drop superfluous indices.

  (-D|--database) <dbname>
        Name of the database to analyze.

  (-u|--user) <dbuser>
        User name for database connection.

  [(-p|--password) <dbpass>]
        Password for database connection.

  [(-h|--host) <dbhost>]
        Database hostname or address to connect to. Defaults to localhost.
        (default: localhost)

  [(-P|--port) <dbport>]
        Database network port. Defaults to 3306. (default: 3306)

As you can see most options are designed to be compatible with the usual MySQL command line tools. The only required parameters are the user and database names. All other options have sensible defaults.

If you specify the --generate-alter-table option, you will not only get an analysis of the indices in you database, but also suggessted ALTER TABLE statements. Please make sure you do not blindly apply those to any database. While I do not think anything horrible can happen, I cannot make any guarantee. It never hurts to have a backup!

Using the --verbose option the report will contain column names, so you can see for yourself if an index is really just a prefix of a longer one.

Example

Consider a database "test" with a single table "people" with the following definition:

 CREATE TABLE `people` (
  `id` int(10) unsigned NOT NULL,
  `firstname` varchar(32) default NULL,
  `lastname` varchar(32) default NULL,
  `birthday` date default NULL,
  PRIMARY KEY  (`id`),
  KEY `idx_id` (`id`),
  KEY `idx_first` (`firstname`),
  KEY `idx_last` (`lastname`),
  KEY `idx_fullname` (`firstname`,`lastname`),
  KEY `idx_reversename` (`lastname`,`firstname`),
  KEY `idx_id_birth` (`id`,`birthday`),
  KEY `idx_id_birth_first` (`id`,`birthday`,`firstname`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Now without further ado just run the DuplicateIndexFinder (the example assumes that you checked out the tag "0.01" from the SVN repository to ~/workspace/MIA0.01):

ds@yavin:~/workspace/MIA0.01
$ java -cp lib/mysql-connector-java-3.1.13-bin.jar:lib/JSAP-2.1.jar:build de.shipdown.util.mysql.DuplicateIndexFinder -u ds -D test

Table people
--------------------------------------------------------------------------------
PRIMARY (len:1)
    obsoletes idx_id (len:1)
idx_reversename (len:2)
    obsoletes idx_last (len:1)
idx_fullname (len:2)
    obsoletes idx_first (len:1)
idx_id_birth (len:2)
    obsoletes idx_id (len:1)
idx_id_birth_first (len:3)
    obsoletes idx_id (len:1)
    obsoletes idx_id_birth (len:2)

This output means

  • that the PRIMARY key obsoletes the idx_id, because they are identical. The primary key is always preserved.
  • that idx_reversename makes idx_last obsolete, because any query that searches for just a last name can just as well use the idx_reversename index, as can a query that looks for a full name combination
  • that idx_fullname makes idx_first useless for the same reason as above
  • that idx_id_birth obsoletes idx_id for the same reason as above. Note that the id column is also covered by the primary key.
  • that idx_id_birth_first obsoletes both idx_id and id_id_birth, because it starts with the same columns as the other two.

If you follow the tool's advice you would modify your schema to remove all indices except for the primary key, idx_reversename, idx_fullname, and idx_id_birth_first. You can get the necessary statements to drop the obsolete one like this:

ds@yavin:~/workspace/MIA 0.01
$ java -cp lib/mysql-connector-java-3.1.13-bin.jar:lib/JSAP-2.1.jar:build de.shipdown.util.mysql.DuplicateIndexFinder -u ds -D test --verbose --generate-alter-table

Table people
--------------------------------------------------------------------------------
PRIMARY (cols: 1:id)
    obsoletes idx_id (cols: 1:id)
idx_reversename (cols: 1:lastname, 2:firstname)
    obsoletes idx_last (cols: 1:lastname)
idx_fullname (cols: 1:firstname, 2:lastname)
    obsoletes idx_first (cols: 1:firstname)
idx_id_birth (cols: 1:id, 2:birthday)
    obsoletes idx_id (cols: 1:id)
idx_id_birth_first (cols: 1:id, 2:birthday, 3:firstname)
    obsoletes idx_id (cols: 1:id)
    obsoletes idx_id_birth (cols: 1:id, 2:birthday)


ALTER TABLE STATEMENTS (PLEASE(!) REVIEW, NO WARRANTY FOR ANY DAMAGE TO YOUR DATABASE)
======================================================================================
ALTER TABLE people DROP INDEX idx_first, DROP INDEX idx_last, DROP INDEX idx_id_birth, DROP INDEX idx_id;

As you see the --verbose switch provides a little more detail in the report. But more interestingly you get a suggested DDL statement. Note that all indices are dropped at once. This is especially important for large tables, as MySQL needs to create a full copy of a table if you issue an ALTER TABLE command.

Using this syntax requires only a single copy process, while removing the indices in separate statements would have required 4 copies and therefore about 4x as long.

Keine Kommentare: