Why Processing is Imperative

Posted by Lekhnath Bhusal on August 17, 2011

In data analytics, incremental processing for the aggregation is very important. When we want to serve real time data, we can not run over the old data and newly added data to calculate the overall aggregate. This makes incremental processing the first priority for real time data analytics. This in turn requires processing over the structured dataset. In this article I will try to compare some of the MPP(Massively Parallel Processing) architectures in this light.

Normally closed source MPP architectures are suited for on-line serving of large datasets. They claim that they are optimized for read intensive queries and can support very large concurrent users. To this end individual products are optimized someway or the other for some specific domain they strive to address. Then, where they lack their performance boost is in the processing.
For the first example, I would like to consider Vertica. This is a projection based MPP architecture with different storage for writing and reading. Biggest achievement of this product is through the projection, columnar structure, and compression. For the processing end they use write optimized storage and move that to the read optimized storage for serving. Yet, it does not provide a convincing processing framework. It supports integration with Hadoop which allows processing over semi-structured dataset. However,most of the time we need processing once we structure the data in some tabular form rather than on the semi structured one. This is a lack in Vertica.
Another platform I would like to consider here is GreenPlum. Greenplum is a shared nothing data architecture cool enough for processing and is built for PostGresql database. It also works well with MapReduce framework. However, it does not have columnar structures, limiting its use in NOSQL friendly environments which are more common in large dataset. Here too, we do not have full support for processing over the structured data.
The examples that I considered above are software solutions. Besides these, we do have options for complete package solutions which optimize from the hardware up. These solutions (including Exadata of Oracle, Netezza, Teradata) also (to the best of my knowledge) do not have the support for the processing over the structured dataset.

Finally, we do have open source Database options: mainly BigTable equivalent implementations. They stand best on this direction. For example, HBase has full support for MapReduce and incremental bulk load over the existing dataset. This, I have to say, is very important step. However, the implementations are still not completely functional for structured data processing: HBase in its current stable version does not support custom code execution over the nodes though BigTable paper laready mentions this as a very important design feature. I hope HBase will come up with this support soon in the future and we will be able to enjoy the best of HBase.