NoSQL Database Solution Evaluation
We here at Deerwalk spend a lot of time working with data in various forms and consider ourselves experts in working within a relational database. With our deep experience in relational databases we have been very interested in the NoSQL movement and how we could apply that to some of the work we are doing with data. For us some of the attraction to this technology is:
- Schemas build to optimize for reads
- Ability to easily scale horizontally
- Ability to have literally millions of columns a handful of which are used for each row
- Ability to apply parallel processing to large data sets
So internally we decided to do an internal investigation into the different NoSQL solutions and find one that would work well for some projects we would be working on in the future.
Identifying the candidates
So the first step in our process was to identify potential solutions that we could use. The initial set of solutions we looked at included:
When looking at these solutions we had some high level requirements that we used to evaluate them on, which were:
- Open source with a friendly license (Apache, GNU, etc)
- Active developers with a history of lots of enhancements
- At least one “anchor” company using the technology
- Something close to production stability
We did a high level evaluation of each of the solutions and ruled some out because of missing features, difficult horizontal scaling schemes and other factors. Eventually we got this list down to three final candidates:
Overview of the candidates
As with pretty much all NoSQL providers all of our candidates are loose implementations of Google’s BigTable white paper. As such they are all designed to support de-normalized, wide tables with lots of columns sparsely populated. We find these features to work well with the data we typically work with.
Key design differences between candidates:
|Distributed File System||HDFS||Built-in||HDFS|
|Server Coordination||Zookeeper||Gossip||HyperSpace (Chubby)|
One other difference between these systems is whether they eventually consistent or consistent. Both Hypertable and HBase claim to be consistent which means that at any time all of the replicas of data on the nodes are consistent with each other, meaning that there is no chance that you could get inconsistent results on a query. Cassandra takes a difference approach and by default guarantees that all replicas will eventually be consistent. This is done in order to give the system a better performance. Cassandra recently has a user option that lets the user specify that the system be consistent across all nodes.
Key Decision Points
Having identified HBase, Cassandra and Hypertable as our final candidates we then identified the key attributes with we were going to evaluate them based on. These sets of attributes were:
- System Setup
- Dynamic Scaling
- Batch Updates
Once these attributes were identified the team was ready to evaluate each. The evaluation was done by having a member of the team go through and setup each of these systems in a fully distributed 5 node cluster. The team first setup identical base systems under our Amazon EC2 account. These instances were configured as EBS backed machines which allowed us to start and stop them as we were working on them.
We should say that while we were expecting to get all three systems to the point where we had metrics on all three and then make a decision based on those metrics we found that there was only one solution that met the first three criteria and as such did not get into an extensive performance test. We are comfortable with the performance we are able to get out of the system and will share that at a later date.
Now that we have the criteria setup we will review our evaluation of these each systems.
When we first started this evaluation we had great hopes for Hypertable. It is written in C++ and we believed that this would mean better performance than the other Java based solutions. Combine that with published reports by the major contributes that it was orders of magnitude faster than either HBase or Cassandra this solution held the most promise.
Our initial encounter with Hypertable was trying to setup the cluster. This required the team to become familiar with Capistrano. In addition since Hypertable is setup to allow the implementer to decide which distributed file system they will use meant that the team had to first setup the Hadoop File System (HDFS) and then navigate the configuration of Hypertable to work with HDFS. This made the setup most difficult of any of solutions.
After the cluster was setup the team found that while Hypertable did perform well in raw tests the loose integration with Hadoop made for overall slow performance. The final straw for Hypertable was the lack of dynamic scaling, our solution needed to be able to scale with the needs of our clients and Hypertable was not yet up to the task. This made for a disappointing decision to rule Hypertable out of consideration.
Cassandra was originally developed by Facebook and had a pretty impressive team of engineers working on it so this seemed like a strong contender. One of the big advantages of Cassandra is supposed to be the easy configuration and deployment. We found that to be true as Cassandra was by far the easiest solution to get up and running. The team had a single node installation working in a couple of hours and found the setup of the multi-node cluster to be fairly straight forward.
Once the cluster was up and running we started to work of building a batch loader that would test the batch processing of Cassandra. The obvious solution for this work was to create a MapReduce task that would interact with the raw data and insert it into the Cassandra database. This is where we found Cassandra to have some significant issues. We ran into multiple problems getting a MapReduce task to interact with Cassandra and eventually decided the lack of documentation and issues was significant enough to rule out Cassandra.
The final solution was the more mature HBase solution. The first task in the HBase solution was to get the Hadoop and HBase services up and running. We initial got a single node cluster up and running which did not leverage the Hadoop File System (HDFS) and was consequently extremely easy to setup.
The 5 node cluster on the other hand was much more difficult. Originally we tried using the base Apache version of both Hadoop and HBase and had significant issues getting the servers to communicate and while HBase has extensive documentation the configuration items are extensive and it is very easy to miss configure part of the system and cause significant issues. Eventually the team found the Cloudera versions of Hadoop and HBase, these versions come with significant scripts that help with the configuration, especially in the EC2 environment.
Once the system was up and running we took the raw test data and begin the task of writing a batch loader. Initially we took a naïve approach to this and did raw inserts inside of Map Reduce tasks. Since HBase is build on top of Hadoop it was extremely easy to execute Map Reduce tasks against the database and we quickly had the naïve approach working. Finding the initial performance to be not desirable we found the HBase bulk load facility (http://hbase.apache.org/bulk-loads.html). The basic mechanics of this approach are to read the raw data into a Map Reduce task that actually writes directly to the file system a set of files that represent the HBase table. This allows for very fast loading of data and we achieved acceptable performance using this method. We will have a later blog article that gives a walk through on how this can be done and some of the issues we encountered and how we overcame them.
Having become comfortable with HBase and having found that the community support and activity we much larger than the other candidates, along with the issues with the other candidates the team decided to adopt HBase as Deerwalk’s distributed database engine.