Pensieri di un lunatico minore

16 September 2008 PostgreSQL

PostgreSQL EXPLAIN

Reading through the blogs today, I came across a little posting on the output of EXPLAIN. The “problem” someone was reffering to was that EXPLAIN wasn’t showing any index usage. The answer?

PostgreSQL’s query optimizer is smart, really smart and unless you have evidence otherwise you should trust what it is doing. In this particular case, the optimizer realizes that if a table has only a few rows that using the index is actually slower than just spinning through the entire table. Just because PostgreSQL isn’t using your index today with a small number of rows, does not mean it won’t choose to use it later when you have more data or the query changes. Because he was just mocking up a design he didn’t have real world data, which is almost always a bad way to performance tune your system unless you are very familiar with how PostgreSQL behaves.

If you read the manual, you begin to understand the depth of “intelligence” that the query planner exhibits. For those who have spent a lot of time working with MySQL, this kind of behavior is probably a bit shocking. Last I knew, MySQL didn’t really have anything that would remotely compare with the query planner in PostgresSQL, Oracle, DB2 or even SQL Server.

The manual discusses it best:

Among all relational operators the most difficult one to process and optimize is the join. The number of possible query plans grows exponentially with the number of joins in the query. Further optimization effort is caused by the support of a variety of join methods (e.g., nested loop, hash join, merge join in PostgreSQL) to process individual joins and a diversity of indexes (e.g., B-tree, hash, GiST and GIN in PostgreSQL) as access paths for relations.

The normal PostgreSQL query optimizer performs a near-exhaustive search over the space of alternative strategies. This algorithm, first introduced in IBM’s System R database, produces a near-optimal join order, but can take an enormous amount of time and memory space when the number of joins in the query grows large. This makes the ordinary PostgreSQL query optimizer inappropriate for queries that join a large number of tables.

So what was the solution? Genetic algorithms, which won’t necessarily find the perfect solution, but will find a good solution very quickly. It doesn’t get used for the kind of queries that most people shoot at it, but when you start to join dozens of tables in weird ways—primarily in business intelligence or data warehousing applications—it can make the difference between finding the answer and crashing the server.

What this means is that the database adapts to the scale you’re working at. When you work with tiny data-sets (dozens/hundreds of rows), it may never use the indexes, and so you might not even bother creating them—especially if it’s a table with high turn-over. As you scale up, and change how you query the data, the database changes to using indexing, query planning and finally genetic optimizations. This raises the question: how do you deal with this in development?

The best strategy, and the one I use as often as possible, is to develop tools/scripts to create bogus data that “looks” like your real data but is largely random (or sequential depending) and can be regenerated as needed. Over time, you’ll amass a large number of utilities that can be glued together to fake databases with thousands of rows.

Why not use the real data? If you’re starting from scratch, you may have no idea what that data is, and it may not exist yet. If you’re working inside an organization, then it’s likely that there are all sorts of privacy and governance issues surrounding getting access to the data. Even if you could, you don’t want to have the responsibility of working with it.

As always, know your tools. The documentation to PostgreSQL is generally quite good, and the community is amazingly talented. Work with them.

This entry was posted at 3:44 pm on 16 September 2008 and is filed under PostgreSQL. You can follow any responses to this entry through the post-specific RSS 2.0 feed.

[...] PostgreSQL EXPLAIN at Pensieri di un lunatico minore "As always, know your tools. The documentation to PostgreSQL is generally quite good, and the community is amazingly talented. Work with them." (tags: postgresql chrispetrilli databases documentation queryplanner) [...]

Both comments and pings are currently closed.