I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line.
Is there a way to achieve that in python?
I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line.
Is there a way to achieve that in python?
You can use IOUtils from sc._gateway.jvm and use it to stream from one hadoop file(or local) to file on hadoop.
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())