pyspark dataframe merge multiple json file data in one dataframe

Question

I am trying to merge multiple json files data in one dataframe before performing any operation on that dataframe. Lets say I have two files file1.txt , file2.txt which contains data like

file1.txt

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}

file2.txt

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}

So I am reading both the files one by one like this

range = ["file1","file2"]
for r in range:
    df = spark.read.json(r)
df.groupby("b","c","d").agg(f.sum(df["a"]))

But the dataframe is overriding the first dataframe data and only showing the 2nd dataframe data. How Can I concat these dataframes? Thanks in advance!

Mariusz · Accepted Answer · 2017-03-01 19:54:27Z

4

You need to union dataframes instead of overriding df variable. For example:

>>> dataframes = map(lambda r: spark.read.json(r), range)
>>> union = reduce(lambda df1, df2: df1.unionAll(df2), dataframes)

Above code maps all files from range array to corresponding dataframes and unions them all.

edited Mar 1, 2017 at 19:54

answered Mar 1, 2017 at 19:38

Mariusz

14k3 gold badges66 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

gashu Over a year ago

Thanks for the quick reply. It is working perfectly fine.

Collectives™ on Stack Overflow

pyspark dataframe merge multiple json file data in one dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related