Blog do projektu Open Source JavaHotel

środa, 10 maja 2017

Sqoop, Hive, load data incrementally

Introduction
Hive is a popular, SQL-like engine over HDFS data and Sqoop is a tool to transfer data from external RDBMS tables into HDFS. Sqoop simply runs SELECT query against RDBS table and the result is stored in HDFS or as a Hive table directly. After the first loading, the effective way to keep tables synchronized is to update Hive table incrementally in order to avoid moving all data again and again. Theoretically,  the task is simple. Assuming that external table has a primary key and source data are not updated or deleted, take the greatest key already inserted into Hive table and transfer only rows whose primary keys are greater than this threshold.
There is also an additional requirement. A very effective data format for Hive tables is Parquet but Sqoop can only create Hive tables in text format. There is --as-parquetfile Sqoop parameter but I failed to try to enable it for Hive tables.
Solution
The solution is uploaded here.
I decided to implement a two-hop solution. Firstly load delta rows into a staging table in text format using Sqoop and afterward insert rows into the target Parquet Hive table. The whole workflow can be described as follows:
  • Recognize if the target Hive table exists already. If yes, calculate the maximum value for the primary key.
  • Extract from external RDBMS table all rows with the primary key greater than maximum or the whole table if the Hive table does not exist yet. Store data into the staging table.
  • If the target Hive table does not exist, create the table in Parquet format. Execute Hive command "CREATE .. TABLE AS SELECT * FROM stage.table
  • If the target Hive table is created already, simply add new rows with command: INSERT INTO TABLE .. SELECT * FROM stage.table
The solution is implemented as Oozie workflow. Can be launched as a single Oozie task or as Oozie coordinator task. Sample shell scripts for both tasks are available here. common.properties file is used as a template for job.properties and coordinator.properties file.

Brak komentarzy:

Prześlij komentarz