Incrementally updating/adding data on HDFS

Question

In my application there are 4 tables and each table is having more than 1 million data.
currently my java based reporting engine joins all the tables and get the data to show in reports.

Now i want to introduce Hadoop using sqoop. I have installed hadoop 2.2 and sqoop 1.9.

I have done a small POC to import the data in hdfs. problem is that, every time it creates new data file.

My need is :

there would be a scheduler which will run once in day, and It will:

Pick the data from all four tables and load in hdfs using sqoop.
PIG will do some transformation and joining in data and will prepare the concrete de normalized data.
Sqoop will again export this data in a separate eporting table.

I have few questions around this:

Do i need to import whole data from DB to HDFS on every sqoop import call ?
in the master table some data is updated and some data in new so how can i handle that if i merge the data while loading in HDFS.
at the time of export do i need to export whole data again to reporting table. If Yes, how would i do that.

Please help me out in this case...

Please suggest me the better solution if you have..

Praveen Sripati · Accepted Answer · 2014-04-19 18:14:01Z

1

Sqoop support incremental and delta imports. Check the Sqoop documentation here for more details.

answered Apr 19, 2014 at 18:14

Praveen Sripati

33.7k18 gold badges85 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Incrementally updating/adding data on HDFS

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related