0

I am trying to merge multiple json files data in one dataframe before performing any operation on that dataframe. Lets say I have two files file1.txt , file2.txt which contains data like

file1.txt

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}

file2.txt

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}

So I am reading both the files one by one like this

range = ["file1","file2"]
for r in range:
    df = spark.read.json(r)
df.groupby("b","c","d").agg(f.sum(df["a"]))

But the dataframe is overriding the first dataframe data and only showing the 2nd dataframe data. How Can I concat these dataframes? Thanks in advance!

1 Answer 1

4

You need to union dataframes instead of overriding df variable. For example:

>>> dataframes = map(lambda r: spark.read.json(r), range)
>>> union = reduce(lambda df1, df2: df1.unionAll(df2), dataframes)

Above code maps all files from range array to corresponding dataframes and unions them all.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the quick reply. It is working perfectly fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.