Data management in SIM/SEM systems
Having spent many years working in the area of security information/event management, and dealing with the complicated nature of data management, I thought maybe I’d address what I see as one of the largest issues with the approach taken by most systems. Put quite simply, they treat the database as the hammer for all their nails, and not just any database, but usually some gigantic behemoth database, like Oracle. Oracle has its purposes, but as I’ll outline below, this isn’t really one of them. In reality, it’s a golden hammer.
The reason many people choose to deploy a SIM or SEM system is to attempt to deal with the influx of data coming from the multitude of sensors in their network. It is not unusual, in a mid-size organization, to deal with hundreds of events per second during many hours of the day. There are actually three data management issues in place, each of which have substantially different requirements, and all three can not be solved by the same technological approach. That is because there are actually three distinct activities being performed.
Reporting While this isn’t the flashiest aspect, it’s actually the one that drives the adoption by most people. Reporting, and by way of it, compliance, is a boring, tedious and periodic activity that runs over large amounts of data, but usually just once. To do this, you could use a large database, with lots of indexes, and write gigantic SQL queries that join dozens of tables together. You could do that, but it’s not a very good way to approach the problem. The problem is better addressed through strategies like mapreduce that make a minimal number of runs through the data, but generally do so sequentially.
The reason a sequential approach to data management is appropriate here is two-fold. First, you are dealing with huge amounts of data, and often throwing it into various “buckets” for reporting. That means lots of queries, which can be difficult to manage in a high performance fashion. In fact, experience tells me that often it ends up being a sequential scan over a large chunk of the data in most cases because keeping enough indexes around to make it any faster would be a killer for insertion performance. Second, if correctly structured, a sequential scan of a huge chunk of data can be performed at a maximum bandwidth through the I/O system. Today’s computers often have substantially more CPU power than I/O bandwidth.
In fact, flat files are often the best strategy for this sort of approach. That is not to say that the data should be stored in its “original” form, such as syslog, but instead it often is partially normalized, and certainly things like timestamps, etc., are dealt with appropriately. By partitioning the data across the time-dimension, and potentially others, it is possible to reduce the amount of data that has to be sequentially read to only slightly more than might be done in a traditional RDBMS, but with a tiny fraction of the overhead. In addition, since I/O bandwidth is often at a premium, and this bulk data can accumulate at a terrifying pace, compression is a huge win for this kind of scenario.
Forensic Analysis The next aspect that is often behind purchases is being able to do postmortem, or at least, post-breech research, which requires searching for various bits of interesting information and tying it together. Often, this is by IP address, but sometimes it involves visualization of historical data, and cross-linking of huge swaths of information. This is not something that is overly response-time sensitive, however full-blown sequential scans are probably too slow for this kind of environment. What is needed is either “predictive indexing” or a new data model. Predictive indexing looks at past history of searches and attempts to form indexes (temporarily) on information of interest. In many ways, this is a form of link analysis1, and the tools for accomplishing it should feed the needs of this. The alternative data model is one which can hold direct links between items and can substantially accelerate the performance of the system. Objectivity DB is one player that has a lot of traction in the intelligence community, as does AllegroGraph, which is built around the idea of RDF.
This is not to say that you would want to store duplicate copies, but instead you are often taking a slice of data (potentially gigabytes) and wanting to “spin it around” to look at how things are interconnected. Often it’s sliced along either the time or address dimension, however there are other methods that may be useful in some situations. Either way, you do not have real-time requirements, but you do need responsiveness once you’re working with the data.
Real-time Analysis The final element is real-time analysis of the incoming stream, and is the bit that is both the flashiest and holds the most promise for easing the life of the analyst. Unfortunately, in my experience, the systems are so rudimentary and simplistic that what they turn into is a simplistic forward-looking “rule” system: IF x AND y THEN z, or something equally trivial. This requires a continual creation of more and more rules, and explodes into an unmaintainable morass. Attempts at statistical analysis can sometimes be more fruitful, but unfortunately, the time line they’re often looking at is measured in minutes, and the truly dangerous attacks often occur over hours or days. The requirements here are similar to the forensic analysis, but constrained in both time and space. Because you have to make decisions quickly, you can’t afford even the luxury of looking something up in an index on disk, but instead need an answer now, which means in-memory databases.
Pish posh, you might say, insinuating that it’s simply too much data, but that’s not true. If you assume 1k/record, including indexing overhead, then a relatively modest machine can hold 64GB of RAM. If instead, you stuck 32GB in it, and reserved 2Gb for OS/software, then you’re talking about approximately 31,457,300 events. At 500 events/second (very high for a normal enterprise), that’s 17 hours of data stored in memory. For a more normal organization, that sees perhaps 50 events/second, you’re talking about 170 hours, or over a week. That’s on a single machine, which costs less than $10k. That might seem expensive, but the license on much of this software is $100K+, and so something like that isn’t an outrageous sum of money for hardware. Shove it on a nice blade server, and you can cram it all in a 4-8U of space in a rack.
The point of all this isn’t to ramble endlessly, though that often can be the case, it’s that people who approach problems with a single hammer (e.g., RDBMS) often don’t understand that different things have differing goals. In a SIM/SEM, the RDBMS is really not that useful except maybe for the forensic-analysis component. Everything else? Doesn’t belong there, and by continuing to lean on the crutch we’re holding ourselves back.
1 Link analysis is a form of network analysis, which is part of the whole field of graph theory.
This entry was posted at 12:56 pm on 28 August 2007 and is filed under Technology. You can follow any responses to this entry through the post-specific RSS 2.0 feed.
Having worked in large data warehousing applications in Oracle, Sybase, Informix, DB/2, PostgreSQL and MySQL, I’d say that if every single implementation suffers from similar issues, then whether or not the concept of relational theory works is really irrelevent. Additionally, the primary goals, and this is something I should have addressed directly, of ACID compliance are simply not interesting in certain applications. The overhead of ACID compliance is huge, and while absolutely critical to the historical mainstay of RDBMS, it’s nothing but needless cruft for some applications.
As for using a RDBMS as a “flat file,” I can tell you that tested benchmarks on commodity hardware shows a near 30-50% hit on sequential scan rates. While this is not the issue in OLTP applications, as you will no doubt be aware, in Data Warehousing sequential scans are the rule, not the exception. Because of that, traditional RDBMS systems are not overly interesting for large data scans.
When your data access model is largely pointer-driven (such as in a network database), then the performance difference is gigantic. In some rough comparisons between PostgresSQL and AllegroCache, I observed order of magnitude or more differences in graph usage. No relational database is designed to handle massive graph-style queries like reachability, condensation, cycle removal, etc. This is especially painful with digraphs that have cycles in them.
I understand that for many applications, RDBMS is an acceptable solution, but to argue that because you’ve never seen it collapse into a gigantic morass of tangled query planner attempts doesn’t mean it doesn’t happen.
Both comments and pings are currently closed.
Ãe problem is that you don’t really understand databases. You identify Oracle woes as inherent to RDBMSs, while ðey are Oracle or even SQL particularities.
When you propose flat files for reporting, you forget RDBMSs can be used in exactly ðe same way, while also making normalised data available for more dynamic uses.
When you talk about forensic analysis proposing object or graph DBMSs, you forget you are fixing a few data acceß paþs in detriment of all others. It may be nice for reporting, but you can do ðe same with foreign keys on the logical side, indexing and materialised views on ðe physical one; when you talk about linking, you are mixing up the logical and physical levels.
Same for real time analysis. In fact, due to normalisation, RDBMSs are uniquely suited for in memory databases.
Don’t let current implementations blind you to concepts and poßibilities.