Writing Files on Hadoop Line by line using python

Question

I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line.

Is there a way to achieve that in python?

Pyspark writes a dataframe, but dataframes typically have a common schema as a whole. Writing "line by line" to HDFS doesn't work that well because it's not meant to be used for file appends — OneCricketeer
– OneCricketeer, Commented Feb 8, 2018 at 13:49

Ged · Accepted Answer · 2019-07-20 14:33:38Z

4

You can use IOUtils from sc._gateway.jvm and use it to stream from one hadoop file(or local) to file on hadoop.

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())

edited Jul 20, 2019 at 14:33

Ged

18.5k8 gold badges53 silver badges108 bronze badges

answered Feb 9, 2018 at 12:31

nevihs

1,0611 gold badge13 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Writing Files on Hadoop Line by line using python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related