How do I integrate HBase with Spark

Spark cannot get all of the Hbase data in certain columns - hadoop, apache-spark, mapreduce, hbase

My Hbase table has 30 million records, each record has the column, raw is column family sample is column. This column is very large and ranges from a few KB to 50 MB. When I run the following Spark code it can only get 40,000 records, but I should get 30 million records:

Right now I'm working that off by first getting the ID list and then doing the ID list to get the column from pure Java client from Hbase in Spark foreach. Any ideas please as to why I can't get all of the columns from Spark, is the column too big?

A few days ago one of my zookeeper nodes and datanodes went down but I fixed it soon as the replica is 3. Would think if I run it would help, thank you very much!

Reply:

1 for the answer № 1

TableInputFormat creates a scan object internally to get the data from HBase.

Try to Create a Scan Object (without using Spark) configured to get the same column from HBase if the error repeats:

In addition, TableInputFormat is configured by default to request a very small block of data from the HBase server. Set the following to Increase the block size:


1 for the answer № 2

For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keep the data pipeline alive. See http://kafka.apache.org/08/uses.html for some use cases of Kafka

One more http://sites.computer.org/debull/A12june/pipeline.pdf