2

I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line.

Is there a way to achieve that in python?

1
  • Pyspark writes a dataframe, but dataframes typically have a common schema as a whole. Writing "line by line" to HDFS doesn't work that well because it's not meant to be used for file appends Commented Feb 8, 2018 at 13:49

1 Answer 1

4

You can use IOUtils from sc._gateway.jvm and use it to stream from one hadoop file(or local) to file on hadoop.

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.