0

In my application there are 4 tables and each table is having more than 1 million data.
currently my java based reporting engine joins all the tables and get the data to show in reports.

Now i want to introduce Hadoop using sqoop. I have installed hadoop 2.2 and sqoop 1.9.

I have done a small POC to import the data in hdfs. problem is that, every time it creates new data file.

My need is :

there would be a scheduler which will run once in day, and It will:

  1. Pick the data from all four tables and load in hdfs using sqoop.
  2. PIG will do some transformation and joining in data and will prepare the concrete de normalized data.
  3. Sqoop will again export this data in a separate eporting table.

I have few questions around this:

  1. Do i need to import whole data from DB to HDFS on every sqoop import call ?
  2. in the master table some data is updated and some data in new so how can i handle that if i merge the data while loading in HDFS.
  3. at the time of export do i need to export whole data again to reporting table. If Yes, how would i do that.

Please help me out in this case...

Please suggest me the better solution if you have..

1 Answer 1

1

Sqoop support incremental and delta imports. Check the Sqoop documentation here for more details.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.