In my application there are 4 tables and each table is having more than 1 million data.
currently my java based reporting engine joins all the tables and get the data to show in reports.
Now i want to introduce Hadoop using sqoop. I have installed hadoop 2.2 and sqoop 1.9.
I have done a small POC to import the data in hdfs. problem is that, every time it creates new data file.
My need is :
there would be a scheduler which will run once in day, and It will:
- Pick the data from all four tables and load in hdfs using sqoop.
- PIG will do some transformation and joining in data and will prepare the concrete de normalized data.
- Sqoop will again export this data in a separate eporting table.
I have few questions around this:
- Do i need to import whole data from DB to HDFS on every sqoop import call ?
- in the master table some data is updated and some data in new so how can i handle that if i merge the data while loading in HDFS.
- at the time of export do i need to export whole data again to reporting table. If Yes, how would i do that.
Please help me out in this case...
Please suggest me the better solution if you have..