Designing Hadoop/HBase Based Solutions
In this blog post, I discuss a (very) high level process for designing a Hadoop & HBase based system. Since SQL based solutions are what people are most familiar with, I will start out discussing how things would be designed in a relational manner and then talk about how the NOSQL solution differs from this. This seems to be the norm when discussing NOSQL solutions.
A good knowledge of data structures always helps in designing good systems, whether we are working with relational databases or NOSQL databases. However, this knowledge is much more important when working with NOSQL systems. One important fact to remember is that while a lot of optimization for SQL based solutions is during query time, NOSQL solutions (especially Hadoop/HBase) are design time optimized. You design an optimized schema and the queries are almost straight forward.
While there are differences, some common terms that people are used to in the relational world have formed the norms in the NOSQL school of thought as well. So, it is natural to start with those common terms and see the differences in their interpretation and the values they hold. I will start here with a very important term “Entity”. An entity in the relational world represents a record definition. Every instance of an entity represents a database record. If a single object has multiple values in many fields, it is broken down either into multiple records with some duplicate information in the rows or broken down (normalized) into many entities. For example, if we wanted to store information about people who may have multiple addresses, we may create a person table (person Entity) and an address table (address Entity). To relate multiple addresses to a Person,depending upon the reading time requirement, we may even have an additional table to show this relationship. So, partly the optimization in relational world also falls in design for reading requirements.
However, in HBase, an entity usually represents a natural fit for real world object covering much more information of an object into an entity. In HBase, a table is really not a table but more of a multi-dimensional map. Going back to the example of a person with multiple addresses, a person table would have:
- RowId of the table as personId
- A Column family ‘Info', which has multiple columns for addresses along with other information. Examples of some columns for the 'Info' column family could be:
- name=”John Doe”,
If we were to add a list of friends to this table along with the date they were added, we wouldn’t create another table/entity, but simply add columns and possibly add a column family, depending on how the data needs to be accessed. If some data is usually accessed together, having them in a single column family would be preferable because the information of a single column family is stored together: sequential read is much faster in disk. If we choose to add another column family called 'Friends', it could have columns like: time1 = personId1, time2 = personId2, and so on. Here, column name times are also the part of information: time when the friend is added to the list. Whenever a new friend is added, a new column would be added to this column family. Suppose we had to add another piece of information to the person entity, such as the user’s comments or messages sent by a person along with the dates. We could add a Comments column family with columns like:
c_date1=comment1. Here, c is the prefix for comment column.
As you can see, adding columns is cheap in HBase. However, changing row structures and column family would be little bit expensive. Therefore, an HBase table has to be well thought out during design time. The basic idea in designing HBase solutions is figuring out how to break the problem into entities so that joins are avoided and all information related to an entity are captured in a single record. This is what we call the first principle of NOSQL solutions: Denormalization, which is the opposite of Normalization used heavily in relational world.
The design of a solution starts with selecting the most generic and important entity and capturing its structure. Then, depending upon the business needs, we proceed to specific information to be obtained from this entity and design specific, smaller tables (aggregated Entities). In the previous example, we have identified ‘person’ as the primary entity of our solution: a message originates from a person, friends belong to a person, comments belong to a person, events belong to a person, and so on. We therefore design a ‘Big table’ for Person entity.
Once that is done, we can start designing the MapReduce process. The basic steps of MapReduce are:
MapReduce → Divide, Process, Accumulate (i.e. Map and then Reduce)
The Map step requires information to be in a format so that many computations can be linearized and be grouped into a chunk: put related information together. The Person table described in the previous example is ready to be used as input for the Map step.
In the Map phase, we read a single record off the person table and then do as many computation / aggregation operations as we can/need for the information in that record so as to reduce IO operations. Examples of such aggregations or summaries are: average number of friends, average number of comments by person, total number of comments by a single person, overall average activity by person, and so on. These aggregated values are written to the smaller summary or aggregate tables for each person record. The basic idea in designing an input table and map reduce job is to group all the computations together in a way such that we can do a lot of computation for a single disk I/O and at the same time limit the amount of data copied over the network per aggregation task: accessing local data is always cheaper.
At this point, most of the optimization and calculation work related to this data has been already done. Real time queries are simply queries against the aggregated values stored in the summary tables. Runtime aggregation through Mapreduce cannot be expected to be sufficient for latency needs. This is the second principal of NOSQL solutions: Design for read.
In short, the basics of designing a solution based on HBase/Hadoop are:
- Capture related information together: Denormalization
- Localized operations: do many computations per I/O and network bandwidth used.
- Design for read or design time optimization: Since data is written once and read many times writes are always satisfactorily fast. We need to think of expected queries at the time of writing.