Migrating data from postgresql to hdfs using a java code

Question

I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.

Can someone please explain a possible way to do this.

I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?

I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.

Thank you.

parisni · Accepted Answer · 2018-06-16 08:57:26Z

2

The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").

Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.

answered Jun 16, 2018 at 8:57

parisni

1,16011 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sainagaraju Vaduka · Accepted Answer · 2018-06-01 11:18:40Z

Sqoop is a simple tool which perform following.

1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table. 2) Use the java class to import and export through a mapreduce program.

If you need to write plain java code (Where parallelism you need to control for performance)

Do following:

1) Create a Java Class which connects to RDBMS using Java JDBC 2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.

Otherway doing this.

Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.

Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.

Thanks Sainagaraju Vaduka

Gopala Manchukonda · Accepted Answer · 2018-06-02 01:47:35Z

You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.

Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.

DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat. Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.

This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.

Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.

If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.

KamakshiKartheek · Accepted Answer · 2018-06-08 06:09:07Z

0

The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).

answered Jun 8, 2018 at 6:09

KamakshiKartheek

12 bronze badges

1 Comment

OneCricketeer Over a year ago

Not sure what this adds over the other answers

Collectives™ on Stack Overflow

Migrating data from postgresql to hdfs using a java code

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related