Querying large tables

I recently completed an optimization pass for a website that does reporting for its clients. A number of reports took more than 30 seconds to compute. As the data grew, performance continued to degrade. The main table in question consisted of tens of millions of rows.

Here was the strategy I employed:

De-normalize the data
Partition the tables
Optimize the indexes on the table
Optimize the queries

Step 1 is to avoid joins. Joins are expensive operations. If you can put all the necessary fields in a single table, your queries will also be much simpler and the database won’t have to perform complex operations. Of course, nothing comes for fee. The cost is duplication of data. While space will rarely be your limiting factor, it’s important to take steps to avoid data inconsistencies.

Step 2 serves to divide your large table into smaller ones. One huge advantage is that the indexes for any of the smaller tables will be much smaller. If a table’s index can be loaded entirely into memory, you will notice huge speed improvements. For this particular application, the database is Postgres and I was able to utilize the inheritance feature to implement partitioning.

Steps 3 and 4 really go hand-in-hand. In an existing application, you have the luxury of examining the slow query logs to determine which queries to focus on. On Postgres, I recommend using pgFouine. As optimization is a never-ending pursuit, these two steps can take as little or as much time as you have. Set realistic goals for your optimization efforts. With Postgres, the explain analyze command will yield vast amounts of data about your queries and the indexes. Use them often.

This optimization effort allows reports that took in excess of 30 seconds to now return in under a second.