Converting big String file to json in python using Pandas

Question

I have a big database of several json file. There are tweets and each tweet is a

{"object":"Message",
"action":"create",
"data":{"id":152376374,"body":"@PeakySellers right there with you brother. 368$","created_at":"2019-01-31T23:59:56Z",
"user":{"id":971815,"username":"mtothe5thpower",}'
}

and I have 3 million row in one file and the size is more than 5GB. I use pandas to read the file and it works well data2=pd.read_table('file', sep="\n",header=None)

Now I have a database and in each row, there is one element (like a tweet that I mentioned earlier) and its type is String. Now I convert each element to a dictionary to use the file and access each element. I am using the code below:

for i, row in data2.itertuples():
    data2["dic"][i] = json.loads(data2[0][i])

While this code successfully converts each string to a dictionary, it is very slow. I think there should be a faster way for this task. Thank you in advance for any help or suggestions.

This doesn't answer your question but you should enumerate like for i, row in enumerate(data2.itertuples()):. That will save you from having to manage i yourself. — Daniel Walker
– Daniel Walker, Commented May 7, 2020 at 18:16

Andrej Kesely · Accepted Answer · 2020-05-07 19:09:37Z

1

You could load the huge file at once using readlines(), and then join it to one huge string and decode in via json.loads().

Benchmark (using file with 100k Json rows):

import json
import pandas as pd
from timeit import timeit

def f1():
    data2=pd.read_table('huge.txt', sep="\n",header=None)

    data2['dic'] = ''

    for i in range(len(data2[0])):
        data2["dic"][i] = json.loads(data2[0][i])

    return data2

def f2():
    with open('huge.txt', 'r') as r:
        l = r.readlines()

    s = '[' + ','.join(l) + ']'

    data = json.loads(s)

    return pd.DataFrame({'dic': data})

t1 = timeit(lambda: f1(), number=1)
t2 = timeit(lambda: f2(), number=1)

print(t1)
print(t2)

Prints on my machine (AMD 2400G, Python 3.8):

102.88191884600383
0.30581573898962233

So this method seems a lot faster.

answered May 7, 2020 at 19:09

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ali Ghalandari Over a year ago

Thank you @Andrej. I used your code on a smaller file, and it worked much faster than the initial code. Nice job.

Collectives™ on Stack Overflow

Converting big String file to json in python using Pandas

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related