Reloading your development database from a copy on the staging or production server can be done with a single command. With the magic of ssh, key pair authentication, and .my.cnf files on the server and your development box, you can execute the following:

ssh -C username@remotehost.com 'mysqldump myapp_staging' | mysql myapp_development

Let’s examine the entire command in pieces. ssh followed by a command allows you to execute that remote command and have its output sent to the originating server’s stdout. mysqldump myapp_staging is executed on the remote server and the -C flag instructs ssh to compress the data from the remote server. Then, mysql executes locally to run the sql statements on the myapp_development database. Voilà, you have reloaded your development database with a copy from the staging server with a single command.

I’m not sure why this setting is not enabled by default, but every MySQL installation should have this line in the [mysqld] section of its my.cnf file:

innodb_file_per_table

This setting instructs MySQL to create a separate file for each innodb table’s data. By default, MySQL stores all innodb data (from all databases) in a single file. Now recall that when you delete data from an innodb table, the actual disk storage is not recovered. Instead, MySQL marks the storage region so that when new data is added, that region can be re-used.  That is a sensible strategy, but fails miserably when you need to delete a lot data and recover the corresponding disk storage.

With a single file for all innodb data, recovering disk storage amounts to backing up the database, dropping and re-creating the database, and then reloading the database from the backup.  In the process, you have to stop your server and remove the /var/lib/mysql/ibdata1 file.  This is rarely a viable strategy.  If you are in desperate need to recover disk space on your server, it’s doubtful that you have room to store the database backup. Moreover, the backup and restore process is far from instantaneous when you’re dealing with hundreds of gigabytes of data. This all amounts to a lot of downtime.

Do things right from the start.  Add innodb_file_per_table to your my.cnf file.  You will have to restart MySQL for this change to take effect. Now when you want to free up disk space, you can work at the granularity of a single table instead of all databases utilizing innodb.

For example, if you want to clear out the majority of a fictitious table named really_big_table, start by running an INSERT SELECT statement to store a copy of the data you want to keep into a temporary table. Then DROP really_big_table and re-CREATE it. Now load the data from the temporary table back into really_big_table via another INSERT SELECT. Your disk storage has been recovered.

I recently completed an optimization pass for a website that does reporting for its clients. A number of reports took more than 30 seconds to compute. As the data grew, performance continued to degrade. The main table in question consisted of tens of millions of rows.

Here was the strategy I employed:

  1. De-normalize the data
  2. Partition the tables
  3. Optimize the indexes on the table
  4. Optimize the queries

Step 1 is to avoid joins. Joins are expensive operations. If you can put all the necessary fields in a single table, your queries will also be much simpler and the database won’t have to perform complex operations. Of course, nothing comes for fee. The cost is duplication of data. While space will rarely be your limiting factor, it’s important to take steps to avoid data inconsistencies.

Step 2 serves to divide your large table into smaller ones. One huge advantage is that the indexes for any of the smaller tables will be much smaller. If a table’s index can be loaded entirely into memory, you will notice huge speed improvements. For this particular application, the database is Postgres and I was able to utilize the inheritance feature to implement partitioning.

Steps 3 and 4 really go hand-in-hand. In an existing application, you have the luxury of examining the slow query logs to determine which queries to focus on. On Postgres, I recommend using pgFouine. As optimization is a never-ending pursuit, these two steps can take as little or as much time as you have. Set realistic goals for your optimization efforts. With Postgres, the explain analyze command will yield vast amounts of data about your queries and the indexes. Use them often.

This optimization effort allows reports that took in excess of 30 seconds to now return in under a second.

© 2011 Technically Speaking Suffusion theme by Sayontan Sinha