Pensieri di un lunatico minore

1 August 2006 Programming

Post-relational thought and the temporal organization of data

As a general contrarian and otherwise trouble maker, I’ve always attempted to look at a problem from any perspective except the commonly accepted one. Recently, I was having a discussion with some people about my general distaste for RDBMS and how I think that the collapse of the alternative database architectures1 in the 1980s, along with the hype surrounding object databases in the 1990s contributed to a stagnation in conceptualization of data management. While the general theoretical underpinnings of the relational model are useful, I think the rigid interpretation of the mathematical foundations into implementations has limited people’s ability to think about the problem2 in different ways.

So that leads us to the primary thing that people struggle with when thinking about data management in a non-RDBMS: how do I find something? We are taught that the only way to get something out of a data store is to use a query language, with SQL being the most readily visible. For example:

SELECT * FROM invoices WHERE customer_id = 120;

If you think about it, this is a great way to create ad-hoc queries into a system3. And there-in lies the trouble with thinking this way. A vast majority of the data retrieval that is performed is most emphatically not ad-hoc and is driven by the needs of software, not the arbitrary whims of a person.

This is where I’m going to turn to a problem domain that categorically is a horrendous fit for traditional RDBMS, namely a problem with the following characteristics:

  1. Substantially higher write than historical read4 ratio (perhaps 100:1)
  2. Low probability of specific record retrieval
  3. High temporal locality of reference
  4. Strong dependence on statistics and summarization for decision making
  1. Zero, or near-zero update likelihood

    Specifically, I’m talking about event management systems. In this case, I’m most familiar with security events (e.g., firewall logs, IDS, VPN, etc.), but much of this is highly correlated with other event-type systems, such as stock-market databases, and command-and-control infrastructure.

    The way this is usually solved is to throw a huge database systems (Oracle lots of times, sometimes MySQL on the lower end) and then use some features of the DBMS, such as partitioning, to make all of this “manageable.” What you end up with is a tiny number of tables, sometimes as few as one, which contain billions of rows of data. Absurd. An atrocity if ever there was one.

    One of the keys to proper system architecture is understanding the usage pattern. This applies doubly so when you are working with very large amounts of data. If you are storing a few thousand records, even a million, the impact of your data design is measured in percentage points. When you speak of billions, or in the case of one project, trillions, of data items, then you’re speaking about orders of magnitude.

    So why do RDBMS suck so badly for this application? It really comes down to the goals of a RDBMS being counterproductive to the goals we’ve outlined above.

    Updates. We don’t want any, ever, and vast amounts of code are written to manage updates to the database. This is, to a large extent, what the ACID model is all about. If you can’t update a record once it’s put in, then a lot of things get easier. Even ideas like MVCC, which don’t actually update existing data, but append new copies, add huge amounts of overhead for something that will never happen.

    Databases are designed for reads, not writes. The average database, in my experience, has less than 1% writes into it. I’m speaking here of accounting, HR, web sites, etc., which make up the backbone of the database industry. Because of this, the systems are designed for retrieval performance, but the minute you start to increase the write-speed into the database, the system falls apart. Even the idea of indexes collapse when your writes are more frequent than your reads. Some systems, notably those targeted at data warehousing, are better suited to this, and focus on other types of retrieval approaches, but still, they assume a “bulk load” mentality with data that stays constant for some period of time.

    Statistics cost a lot of time. There are two ways, in a traditional RDBMS, to get statistics out of your data. The first is to use all sorts of magical incantations in the SQL to summarize the data how you want it. This works great if you need it once, but is a farce if you need constant access to the statistics, and perhaps even more access than you need to the underlying data. The second is to use something like stored procedures and triggers to automatically keep a parallel summary table up-to-date with the statistics. This can work reasonably well when you have a low insert rate, but my experience is that higher and higher insert rates tend to cause this to collapse.

    So how do we store data for this kind of environment? First, the data is primarily temporal in nature, meaning that it is time sensitive. When you go to remove data, 95% of the time, you are cutting through your data store based on time, and no other characteristic. So, in order to organize data viably in the time domain, I propose this:

    The circles at the bottom represent individual events in the system. They are aggregated in real-time into minutes, hours, days, months and years. The accumulation of new events in monotonic and things are appended to each minute until the next one starts, whereby the next event is attached to that minute6.

    One of the nice things about this is that the objects can “auto-summarize” themselves. For example, if you were to send the message count to a Minute, then it can count the number of events it contains and report back. Once the minute is “closed,” then we never have to count them again and can keep that statistic for future reference. The same thing is true as we move up the scale. As an Hour closes, we can freeze all the statistics, even very expensive ones, so that we never have to calculate them again.

    By making these calculations lazy and on-demand, we can ignore them until someone actually cares, deferring some of the most expensive operations. For example, simple operations like count, average, standard deviation, etc., might be calculated as soon as possible, but others might be delayed until they are actually discovered.

    The question that always comes up when discussing this sort of organization of data is: how do I find anything? The answer is the same as it always is: you search for it. How you do that might be different though. In the case of this kind of temporal data organization, traversal is actually one of the easier ways to find data when your primary limiting characteristic is the time domain. Once you’ve limited the available data to examine to, let say, a month, then you have a more fine-grained retrieval mechanism you need to deploy.

    That mechanism is the topic of my next post on this issue.

    1 Alternative architectures included network, hierarchical and a selection of blended ideas.

    2 It is an instantiation of the Sapir-Worf hypothesis, whereby our ability to think about problems is limited by the language that we have available. Given few people are familiar with any model for databases other than strict relational models, we are limited in our discussion.

    3 I personally think that the popularity of SQL is a boon overall to the data management community. The familiarity of people with an ad-hoc query language is useful, however it has given rise to a level of laziness that can be terrifying sometimes.

    4 Historical read refers to a read that does not query against data that was just inserted. This is the kind of abuse that I find abhorrent in many database applications.

    5 The entire market for WORM optical media was driven by the financial and securities communities in order to provide a provable record that hadn’t been tampered with. An old employer of mine, Ten X Technologies even went so far as to develop a time-travel system that would allow you to revert all media to previous states in time. They sold very well.

    6 There is always the discussion of time in the security space, especially when the topic of data correlation comes up. Unfortunately it is my experience that one can rarely trust the time stamps that are received from the edge devices. More often than not, the clocks are simply wrong on them, and even the basics of deploying NTP have been ignored. While this time should not be forgotten, it can never be fully trusted.

    This entry was posted at 1:31 pm on 1 August 2006 and is filed under Programming. You can follow any responses to this entry through the post-specific RSS 2.0 feed.

    In a prior life I built datawarehouses for casinos. The patterns there were:

    • Huge aggregated reads, never by record.
    • Daily inserts, perhaps even less frequently.

      We made use of Sybase’s IQ system. While it was a pain in the ass in some circumstances, it was ‘fundamentally different’ in a number of useful ways.

      Data was stored by column, not by row. So ‘AVG’ was extremely fast, but select * from whatevers; was much slower.

      Its optimiser was particularly good at the kind of OLAP queries we’d issue (f.ex http://pastie.caboo.se/6927).

      On the downsides, INSERTs and UPDATEs were abominable. The custom LOAD TABLE functionality could pull in 100k rows per second, but an insert would often take 4 seconds per row.

      We had access to SQL, it was optimised for our usage, basically it was great fun.

    MonetDB is another example of a vertically-partitioned database that is designed for OLAP-style work. I’ve played with it some, and for certain types of queries, it’s 10x faster, just as you discuss.

    I also find a lot of the work that Google has done on BigTable to be interesting in this regard as they implement vertical partitioning as well.

    Responses are currently closed, but you can trackback from your own site.