Pensieri di un lunatico minore

27 January 2005 PostgreSQL

Enormous databases in PostgreSQL

So, I’ve been toying with storing huge amounts of data in PostgreSQL, mostly to avoid dealing with Oracle, and to find out if an open-source database can compete on the larger projects. By “huge amounts,” we’re talking hundreds of millions of rows, not just a few thousand.

The first problem I hit was INSERT performance, which is obvious once I thought about it. Since I’ve used SQL*Loader in Oracle a lot, I just switched over to using the COPY SQL syntax, which gave me anything from a 20-100x increase in performance. With that solved, I had to look at table organization and indexing.

I need to keep roughly 6 indexes on the table, one on a string, but the rest on integer fields. Unfortunately, I’ve found that performance of inserts seems to arithmetically degrade as you get above 1M rows. Before that it’s pretty flat, for all intents and purposes. My personal box drops from being able to do 5,000 rows/second to doing 500 rows/second, and so I’m investigating two options to deal with this, as it seems to be index-driven:

  1. Use partial indexing, based on some other factor, like time
  2. Use views to hide multiple tables
  1. Leverage inheritance in PostgreSQL

    The first is more “normal” of a solution, and similar to how one might do data partitioning on Oracle, although Oracle would also partition the data. Since I’d like to partition the data, I’m looking at inheritance as a solution. This appeals to me more than doing views, since it doesn’t have the maintenance overhead of keeping a view up to date. The idea being that I have a “root” in my inheritance graph:

    CREATE TABLE fooroot (
        ...
        ...
        ...);

    That creates everything I need in all the partitioned tables. From there, I just do this:

    CREATE TABLE fooinst1 () INHERITS (fooroot);

    That means queries work right, etc., but I do have to make sure each time I create one of these partition tables, I have to rebuild the indexes and foreign keys. Fortunately, this table only has a few foreign keys, and everything else can be rebuilt on the fly in a templated manner.

    The question then becomes, can I apply the use of tablespaces to solve the partitioning of data at that point? That would transform the DDL to:

    CREATE TABLE fooinst1 () INHERITS (fooroot) TABLESPACE foo2;

    I’ve got to test it out first, then build some tools to benchmark it, and make sure that I can build the tablespaces on the fly, and come up with a scheme for allocating them and determining when/where to break. The problem is that PostgreSQL, like Oracle, and any other MVCC, some simple queries with aggregates, like:

    SELECT COUNT(*) FROM FOO;

    which would normally give you a row count, can be very expensive operations, because they require a sequential scan due to the nature of multiple-versions being visible at the same time. That means either I need to trip off an internal counter, which isn’t so bad since I know who is doing the loading (me), or use a “range” setting, like time.

    Either way, I’ll then need to look at the performance differences between the two approaches, especially on concurrent retrieval time.

    This entry was posted at 4:49 pm on 27 January 2005 and is filed under PostgreSQL. You can follow any responses to this entry through the post-specific RSS 2.0 feed.

    Enormous PostgreSQL Databases

    Christoper Petrilli, over at PENSIERI DI UN LUNATICO MINORE discusses the inherent challenges of using the PostgreSQL DBMS in what is traditionally Oracle-dominated territory, tables with hundreds of millions of rows.

    Responses are currently closed, but you can trackback from your own site.