Deerwalk Blog

Secondary Indexing in Hbase for easy sorting

Posted by Deerwalk on February 15, 2012

This is a introductory tutorial on possible methodology that can be used for sorting the data in Hbase in descending order. Further it also provides insight on what Hbase is and how it differs from conventional RDBMS

Overview:

Hbase is counterintuitive for any one with experience of row-based traditional Relational Database System. The major difference between Hbase or any of the other Google Bigtable clones with RDBMS, is that unlike RDBMS which mainly focuses on the question how the data is going to be stored, they instead focuses on how to access the data. That is RDBMS model mainly deals with identifying the entities and their relationship thus mandating the normalization of data for ACID assurance. But in NoSql model like Hbase data is stored in column based big table which is in reality a huge HashTable. Each quantum of record is stored in a cell which is addressed with row key/ column family/ cell qualifier/ timestamp tuple. Since underlying data structure Hashtable has in general average and amortized space time complexity of O(1) and worst case complexity of O(n) it is suitable for retrieving the large data in quickest pace as possible. Further Hbase is extremely efficient in dealing with sparse data ( records where most of the value for columns are missing)

The biggest intuitive hurdle that Hbase imposes is that it contains immutable column family which is a container for columns where each column name can itself contain data.
For instance if we are tracking a record of patient arriving in hospital then following can be a plausible schema:

PateintId1-> Person: Name -> Ram
Person: Age -> 30
Person: Gender-> Male
Disease:Cold: DiagnoseDate -> 2-10-2010
Disease:Cold: Condition -> Complain
Disease:Fever: DiagnoseDate -> 12- 25-2012
Disease:Fever:Condition -> Mild

PatientId2-> Person: Name -> Radha
Person: Age -> 32
Person: Gender -> Female
Disease: Bronchitis: DiagnoseDate -> 2-20-2007
Disease: Bronchitis: Condition ->Serious

In the above schema, table has rowkey: PatientId and Column Family: Person and Disease. In case of PatientId1, he visited the hospital twice and for two different cases Cold and Fever. So the column name (Disease:Cold:DiagnoseDate) itself contains data Cold as a part of column qualifier (Cold:DiagnoseDate). And for both cases there are two different column names. Likewise in case of PatientId 2, Ms. Radha visited the hospital with Bronchitis case and for this under column family Disease there is entry of new column Bronchitis:DiagnoseDate and Bronchitis:Condition.

In relational database this would have been mapped using a one to many mapping table but here simply a new column is added for each visit. Therefore in Hbase there are two kind of column family namely: Static (Person) and Dynamic( Disease).

Meanwhile to access a value of particular tuple a GET operation with combination of rowkey (PatientId) and columnqualifier key( Disease:$DiseaseName$:DiagnoseDate) needs to be carried out.

Sorting Problem:

In RDBMS there is a provision of sorting the data value using the order by clause in sql. On the contrary Hbase lacks such direct sorting mechanism. Instead Hbase maintains the table by sorting it in ascending order with respect to rowkey value. In our given example the sorted data hence appears as Patient1, Patient2 respectively. This inherent architecture however imposes the biggest challenge for carrying out the sort with respect to some data value in ascending or descending order. For example if we want to sort the details of patient with respect to their diagnose date in descending order or with respect to their name in descending order then there is no direct mean to achieve it.

Solution Secondary Index:

The work around for this issue is making use of secondary index. The idea is to maintain a separate column family called 'Index' inside the table. And this column family will contain the indexed key as its qualifier. Meanwhile one particular index will be accessed through one particular rowkey value.
The following illustration will explain it in more detail

If we have data in following manner:

PatientId 1, Person:Name Ram, Person: Age 30, Disease:Cold: DiagnoseDate 2-10-2010

PatientId 1, Person:Name Ram, Person: Age 30, Disease:Fever: DiagnoseDate 12-15-2012

PatientId 2, Person:Name Sita, Person: Age 27, Disease:Fever: DiagnoseDate 5-11-2009

PatientId 3, Person:Name Hari, Person: Age 35, Disease:HeartProblem: DiagnoseDate 2-20-2011

PatientId 3, Person:Name Hari, Person: Age 35, Disease:HeartProblem: DiagnoseDate 6-17-2010

PatientId 4, Person:Name Radha, Person:Age 32, Disease: Bronchitis: DiagnoseDate 2-20-2007

PatinetId 5, Person:Name Shyam, Person:Age 19, Disease: Insomnia: DiagnoseDate 8-19-2011

then to sort this we have to use MapReduce to first read all the diagnose date . Once it is done the date are converted to its respective long value and then it is subtracted from Long.MAX_VALUE. The subtracted value will now be an index for diagnose date in descending order. And should be added as column qualifier to 'Index' family once it is done. And the value for column should be the rowkey identifier of original data that is PatientId concatenated with disease case. This index should be then stored inside a specific rowkey “PatientArrival”

The result of MapReduce produced a secondary index in following manner

Similarly if we have to sort the data in the table on the basis of patient's name then a MapReduce job which will read the Person:Name and create a index key by performing its bit inversion should be employed.

Implementation of Date sort:

1. a) Map Reduced Job was fed with following scanner:
private Scan configureScan()
{
String family="Disease";
Scan s = new Scan();
s.addFamily(family.getBytes());
//Only filter those qualifier ending with DiagnoseDate
Filter filter= new QualifierFilter(CompareFilter.CompareOp.EQUAL,new RegexStringComparator("DiagnoseDate$"));

s.setFilter(filter);
return s;
}
Explanation: Here we scan only those tuple inside the column family Disease with qualifier ending in 'DiagnoseDate'. Since column family disease is dynamic we are resorting to use RegularExpression Filter to glean only the necessary data.

b) Implementation of Map class: Map class is implemented as following:

public static class Map extends TableMapper < Text , Text>
{
@Override
public void map(ImmutableBytesWritable row, Result result,Context context) throws IOException
{
try
{

List list= result.list();
for(KeyValue kv: list)
{
String rowkey=Bytes.toString(kv.getRow());
System.out.println("Patient:"+rowkey);
String diseasedate= Bytes.toString(kv.getValue());
DateFormat formater= new SimpleDateFormat("mm-dd-yyyy");
Date date= (Date) formater.parse(diseasedate);

//Obtain disease name as well

String qualifiername=Bytes.toString(kv.getQualifier());

StringTokenizer st= new StringTokenizer(qualifiername,":");
String diseasename=st.nextToken();

//now to put in descending order
long longval=date.getTime();
long maxlongval= Long.MAX_VALUE-longval; //descended sort key

context.write(new Text(""+maxlongval), new Text(rowkey+"-"+diseasename));
//value will be originalrowkey-diseasename

}
}

catch (Exception e)
{
e.printStackTrace();
}
}
}

Explanation: According to scanner all the rows are returned and from which Diagnose Date is evaluated which is then converted to equivalent Long value. To place the data in sorted descending order the long value is subtracted from Long.MAX_VALUE to obtain intermediate key which with its value pair is emitted to Reduce class

c) Implementation of Reduce class
The reduce class is implemented in following fashion:

public static class Reduce extends TableReducer
{
@Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException
{
context.setStatus("Inside the reduce");
String rowkeyforsecondaryindex="PatientArrival";
String family="Index ";
for(Text secondaryindexvalue : values)
{

String columnqualifier=key;
Put put = new Put(Bytes.toBytes(rowkeyforsecondaryindex));
put.add(Bytes.toBytes(family), Bytes.toBytes(columnqualifier), Bytes.toBytes(secondaryindexvalue));

context.write(new ImmutableBytesWritable(Bytes.toBytes(rowkeyforsecondaryindex)), put);

}
}
}

Explanation: The reduce class simply Puts the secondary index key/value pair into Hbase table under column family 'Index' with rowkey 'PatientArrival'

d) Invocation

The MapReduce job is invoked using following code snippet

String tablename='NewPatientInfo';
Scan s=configureScan(); //This function calls the
Job job = new Job(getHbaseConfiguration(),"This is my HbaseMapReduce Secondary Index Generation");
job.setJobName("SecondaryIndexGenerator");
job.setJarByClass(SecondaryIndexGenerator.class);

TableMapReduceUtil.initTableMapperJob(tablename, s, Map.class, Text.class, Text.class, job);

TableMapReduceUtil.initTableReducerJob(tablename, Reduce.class, job);

Result: After running the job a new secondary index mapped by rowkey 'PatientArrival' is generated.

hbase(main):013:0> get 'NewPatientInfo', 'PatientArrival'
COLUMN CELL
Index :9223370710250 timestamp=1329191331200, value=patient1-Fever

Index :9223370741355 timestamp=1329191331202, value=patient3-HeartProblem

Index :9223370741441 timestamp=1329191331202, value=patient5-Insomnia

Index :9223370773150 timestamp=1329191331202, value=patient3-ChestPain
Index :9223370773755 timestamp=1329191331203, value=patient1-Cold
055807-patient1
Index :9223370805204 timestamp=1329191331203, value=patient2 -Fever
475807-patient2
Index :9223370867585 timestamp=1329191331203, value=patient4-Bronchitis
7 row(s) in 0.0140 seconds

Note- In the above listing index data is appearing in descending order providing the patientId rowkey as its value with the disease name conveniently concatenated to make the data readable.

Implementing String sort:

Implementing string based descending sort is also lot similar. For instance when we want to sort the table with respect to patient name in descending order then only change in obtaining scanner and Mapper class is required.

I) Scanner to be fed into Mapper class
private Scan configureScan()
{
String family="Person";
String qualifier="Name";
Scan s = new Scan();
s.addColumn(family.getBytes(), qualifier.getBytes());
return s;
}

Explanation: Since the Person is static family we do not have to worry about its qualifier containing data in itself as was in the case of disease. So all we have to issue is the general scan command with family:qualifer combination of Person:Age

II) Implementation of Mapper class:

public static class Map extends TableMapper < Text , Text>
{

@Override
public void map(ImmutableBytesWritable row, Result result,Context context) throws IOException
{
try
{
List list= result.list();
for(KeyValue kv: list)
{
String rowkey=Bytes.toString(kv.getRow());
String personname= Bytes.toString(kv.getValue());
//Performing Bit Reversal of Personname
StringBuilder bitreversedperson= new StringBuilder();
for(int i=0 ; i {
bitreversedperson.append(Integer.toString(personname.charAt(i)^0xFF,16));
}

context.write(new Text(""+bitreversedperson.toString()), new Text(rowkey+"-"+personname)); //count the occurence of disease
context.getCounter(Counters.LINES).increment(1); //In previous version we had used the Reporter classs to increment the counter
}
}

catch (Exception e)
{
e.printStackTrace();
}
}
}

Explanation: In above code snippet to generate the descending order index in terms of person's name all one has to do is perform bit reversal of every character in the string which is accomplished through following for loop:

for(int i=0 ; i{
bitreversedperson.append(Integer.toString(personname.charAt(i)^0xFF,16));
}

The bit reversed value is then stored in the hbase under the some rowkey identifier “PatientName” by following reduce class

public static class Reduce extends TableReducer
{
@Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException
{
context.setStatus("Inside the reduce");
String rowkeyforsecondaryindex="PatientName";
String family="Index ";
for(Text patientrowkey : values)
{

String columnqualifier=key.toString();
Put put = new Put(Bytes.toBytes(rowkeyforsecondaryindex));
put.add(Bytes.toBytes(family), Bytes.toBytes(columnqualifier), Bytes.toBytes(patientrowkey.toString()));

context.write(new ImmutableBytesWritable(Bytes.toBytes(rowkeyforsecondaryindex)), put);

}
}
}

Result:

hbase(main):003:0> get 'NewPatientInfo', 'PatientName'
COLUMN CELL
Index :ac968b9e timestamp=1329198392197, value=patient2-Sita
Index :ac97869e92 timestamp=1329198392198, value=patient5-Shyam
Index :ad9e92 timestamp=1329198392198, value=patient1-Ram
Index :ad9e9b979e timestamp=1329198392198, value=patient4-Radha
Index :b79e8d96 timestamp=1329198392198, value=patient3-Hari
5 row(s) in 0.0260 seconds

The result shows the index is sorted in descending order based on patient's name

Limitation: The hbase schema design is based on one simple principle 'Anything Goes'. But the biggest stumbling block is certainly the key design. Any meticulous reader will detect the inherent flaw in above column qualifier key design of the secondary index which are as follows:

I) Column Key Collision
II) Widening of table

I) Column Key Collision: This occurs when two person have same name or two different person visit the hospital on same date. In this case both records will map to exact same column qualifier key. To solve this problem it is better to concatenate original row key with the index key. For instance Bit Reversal key of patient Ram is “ad9e92”. So instead of generation column “Index:ad9e92” to index the data , generate “Index:ad9e92-patient1” as the column key.

II) Widening of Table: With every increase in index the table tends to get wider as secondary index is being stored in column wise manner. For efficient access of table it is suggested that the hbase table should be tall and narrow. For this purpose a Partial Key technique should be employed. That is instead of adding all the index inside one monolith row such as - “PatientArrival” or “PatientName”, create separate row where Column qualifier will replace the row key. For example now a separate row with row key ad9e92-patient1 is inserted into hbase.

Conclusion: Though sorting in hbase is not straight forward as in the case of RDBMS yet it is not impossible to achieve. A fine balance of frequent MapReduce job and meticulous schema design technique will accomplish it, For further details I suggest the readers to go though a brilliant book “Hbase: The definitive guide” by Lars George.

Subscribe to Blog Updates

Posts by Topic

see all