Introduction to Non-Relational Data Storage using Hbase
Big Talk About BigTable
Relational Databases have hegemony on the way data has been stored. Proponents of Relational Database emphasize on normalization for valid reasons. Maintainability, Integrity (summed up in ACID) and Security has been the primary focus of classical relational database problems. Backed by Moore’s law it was easy to theorize that processing speed would inevitably be a trivial factor over other engineering problems like consistency and integrity. In some sense that is true but hitherto unaccounted for were sites like Google and Facebook which needed to process PetaBytes of data every second in real time.
Processing is much cheaper than what it was even a decade ago but it is still very costly compared to storage. We can get TeraByte hard-disks we can carry around in pockets but the curve for increase in processing speed for a constant price is less steep. With the advent of sites like Google it had become necessary to leverage this difference and to sacrifice storage for the sake of efficiency. Out of this came what is now known as Google BigTable.
BigTable is a non-relational database which leverages distributed processing and storage to allow reading of large amount of data in a relatively short time. Using the same technology the Apache foundation implemented similar system called HBase over their distributed file system Hadoop. Hbase is currently used by sites like Twitter and Facebook and has proven to be much more efficient than relational databases as expensive operations like joins can be avoided.
What is Hbase ?
Like the Bigtable, Hbase is a sparse, distributed, persistent multidimensional sorted map.
From the wikipedia article, a map is "an abstract data type composed of a collection of keys and a collection of values, where each key is associated with one value." This means that each of the row as in RDBMS can be identified by a unique key. This key is just a sequence of bytes which in uninterrupted by the system but must have a string representation. This is necessary because rows are sorted by the system in storage according to the key. This is for the reason that most similar columns be put adjacent to each other.
Logically, a table can be divided in rows and each row in column families. A column family contains arbitrary number of columns which can be addressed by “column-family: column-name”. To make the concept of column family clearer lets take an example of person table.
The table key can be the email address of the person and the first column family can be “name”. The column family “name” can contain columns “first-name”, “last-name” and “middle-name”. So if for the first row you might get the first name as “name:first-name”. An important point here is that even though each row contains same column family the columns inside the family can be different. This means that the second row may not contain “first-name” at all while first row may contain all the columns. This is what is meant by sparse.
Each column can contain various versions of the data called time-stamps. For example, the person table might contain a column family work-place, with column company to record where the person works. The work-place: company will change when the user changes jobs, so it will have different versions depending on time.
HBase not ACID-compliant, but does guarantee certain specific properties
All mutations are atomic within a row. Any put will either wholly succeed or wholly fail.APIs that mutate several rows will not be atomic across the multiple rows.The order of mutations is seen to happen in a well-defined order for each row, with no interleaving.
Consistency and Isolation
All rows returned via any access API will consist of a complete row that existed at some point in the table's history.
Consistency of Scans
A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation. Those familiar with relational databases will recognize this isolation level as "read committed".
All visible data is also durable data. That is to say, a read will never return data that has not been made durable on disk. Any operation that returns a "success" code (e.g. does not throw an exception) will be made durable. Any operation that returns a "failure" code will not be made durable (subject to the Atomicity guarantees above).
All reasonable failure scenarios will not affect any of the listed ACID guarantees.
In short we went over how Hbase is structured and its features in general. We also looked at why it came into being and how it is serving its purpose now.