I'm trying to aggregate a bunch of CSV files into one and output it to S3 in ORC format using an ETL job in AWS Glue. My aggregated CSV looks like this:
header1,header2,header3
foo1,foo2,foo3
bar1,bar2,bar3
I have a string representation of the aggregated CSV called aggregated_csv has content header1,header2,header3\nfoo1,foo2,foo3\nbar1,bar2,bar3.
I've read that pyspark has a straightforward way to convert CSV files into DataFrames (which I need so that I can leverage Glue's ability to easily output in ORC). Here is a snippet of what I've tried:
def f(glueContext, aggregated_csv, schema):
with open('somefile', 'a+') as agg_file:
agg_file.write(aggregated_csv)
#agg_file.seek(0)
df = glueContext.read.csv(agg_file, schema=schema, header="true")
df.show()
I've tried it both with and without seek. When I don't call seek(), the job completes successfully but df.show() doesn't display any data other than the headers. When I do call seek(), I get the following exception:
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-48-255.us-west-2.compute.internal:8020/user/root/header1,header2,header3\n;'
Since seek seems to change the behavior and since the headers in my csv are part of the exception string, I'm assuming that the problem is somehow related to where the file cursor is when I pass the file to glueContext.read.csv() but I'm not sure how to resolve it. If I uncomment the seek(0) call and add an agg_file.read() command, I can see the entire contents of the file as expected. What do I need to change so that I'm able to successfully read a csv file that I've just written into a spark dataframe?