JSON extract to pandas dataframe

Question

I'm currently trying to process a json as pandas dataframe. What happened here is that I get a continuous stream of json structures. They are simply appended. It's a whole line. I extracted a .txt from it and want to analyse it now via pandas.

Example snippet:

{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}...

as you see in this light snipped is, that every json starts with {"positionFlightMessage": and ends with messageSubtype":"ADSB"

After a json ends, the next json just appends after it.

What i need is a table out of it, like this:

95b3b6ca-5dd2-44b4-918a-baa51022d143    1.0-RC1 1533134514  DLH1601 4.414.525   -131.849    340 24.0    ADSB    AFR1601-1532928365-airline-0002 AFR AFR89GA 442.0   34000.0 ADSB
884708c1-2fff-4ebf-b72c-bbc6ed2c3623    1.0-RC1 1533134515  DLH012  3.734.542   14.379.951  320 54.0    ADSB    EVA12-1532928367-airline-0096   DLH EVA012  462.0   32000.0 ADSB

i tried to use pandas read json but i get a error.

import pandas as pd

df = pd.read_json("tD.txt",orient='columns')

df.head()

ValueError: Trailing data

tD.txt has the above given snippet without the last (...) dots

I think the problem is, that every json is just appended. I could add a new line after every

messageSubtype":"ADSB"}}

and then read it, but maybe you have a solution where i can just convert the big txt file directly and convert it easily to a df

Looks like there is no separator between your json objects. Either read with lines=True or put the objects in a list and separate the objects with a comma. — user3471881
– user3471881, Commented Jan 9, 2019 at 9:35
So i need to iterate over the .txt file and add a "," after every ended json, correct? — Bernouy
– Bernouy, Commented Jan 9, 2019 at 9:38

NemoMeMeliorEst · Accepted Answer · 2019-01-09 09:27:03Z

Try to get the stream of json to output like the following:

Notice the starting '[' and the ending ']'. Also notice the ',' between each json input.

data = [{
    "positionFlightMessage": {
        "messageUuid": "95b3b6ca-5dd2-44b4-918a-baa51022d143",
        "schemaVersion": "1.0-RC1",
        "timestamp": 1533134514,
        "flightNumber": "DLH1601",
        "position": {
            "waypoint": {
                "latitude": 44.14525,
                "longitude": -1.31849
            },
            "flightLevel": 340,
            "heading": 24.0
        },
        "messageSource": "ADSB",
        "flightUniqueId": "AFR1601-1532928365-airline-0002",
        "airlineIcaoCode": "AFR",
        "atcCallsign": "AFR89GA",
        "fuel": {},
        "speed": {
            "groundSpeed": 442.0
        },
        "altitude": {
            "altitude": 34000.0
        },
        "nextPosition": {
            "waypoint": {}
        },
        "messageSubtype": "ADSB"
    }
}, {
    "positionFlightMessage": {
        "messageUuid": "884708c1-2fff-4ebf-b72c-bbc6ed2c3623",
        "schemaVersion": "1.0-RC1",
        "timestamp": 1533134515,
        "flightNumber": "DLH012",
        "position": {
            "waypoint": {
                "latitude": 37.34542,
                "longitude": 143.79951
            },
            "flightLevel": 320,
            "heading": 54.0
        },
        "messageSource": "ADSB",
        "flightUniqueId": "EVA12-1532928367-airline-0096",
        "airlineIcaoCode": "DLH",
        "atcCallsign": "EVA012",
        "fuel": {},
        "speed": {
            "groundSpeed": 462.0
        },
        "altitude": {
            "altitude": 32000.0
        },
        "nextPosition": {
            "waypoint": {}
        },
        "messageSubtype": "ADSB"
    }
}]

Now you should be able to loop over each 'list' element in the json and append it to the pandas df.

print(len(data))
for i in range(0,len(data)):
    #here is just show messageSource only. Up to you to find out the rest..
    print(data[i]['positionFlightMessage']['messageSource'])
    #instead of printing here you should append it to pandas df.

Hope this helps you out a bit.

Aditya · Accepted Answer · 2019-01-09 10:28:09Z

0

Now here's a solution for your JSON as is using regex.

s = '{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}'

import re
import json
replaced = json.loads('['+re.sub(r'{\"positionFlightMessage*', ',{\"positionFlightMessage', s)[1:] + ']')

dfTemp = pd.DataFrame(data=replaced)
df = pd.DataFrame()
counter = 0
def newDf(row):
  global df,counter
  counter += 1
  temp = pd.DataFrame([row])
  df = df.append(temp)
dfTemp['positionFlightMessage'] = dfTemp['positionFlightMessage'].apply(newDf)
print(df)

First we replace all occurrences of {"positionFlightMessage with ,{"positionFlightMessage and discard the first separator.
We create a dataframe out of this but we have only one column here. Use the apply function on the column and create a new dataframe out of it.
From this dataframe, you can perform some more cleaning.

answered Jan 9, 2019 at 10:28

Aditya

1,3771 gold badge9 silver badges19 bronze badges

6 Comments

Bernouy Over a year ago

Thanks for your solution. I am getting after replaced = json.loads...the following error when i try to load the full extract (1gb) : Expecting , delimiter: line 1 column 719443613

Aditya Over a year ago

@Bernouy Do the JSON objects always start with a {"positionFlightMessage?

Bernouy Over a year ago

yes it should start exactly like this. how can check what the problem is? how can i look into column 719443613 without open the file in a editor

Aditya Over a year ago

@Bernouy Try doing s[719443607:719443617]

Aditya Over a year ago

@Bernouy Well that is weird. Can you get a bigger chunk? 600-650?

|

Collectives™ on Stack Overflow

JSON extract to pandas dataframe

2 Answers 2

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related