1

I'm currently trying to process a json as pandas dataframe. What happened here is that I get a continuous stream of json structures. They are simply appended. It's a whole line. I extracted a .txt from it and want to analyse it now via pandas.

Example snippet:

{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}...

as you see in this light snipped is, that every json starts with {"positionFlightMessage": and ends with messageSubtype":"ADSB"

After a json ends, the next json just appends after it.

What i need is a table out of it, like this:

95b3b6ca-5dd2-44b4-918a-baa51022d143    1.0-RC1 1533134514  DLH1601 4.414.525   -131.849    340 24.0    ADSB    AFR1601-1532928365-airline-0002 AFR AFR89GA 442.0   34000.0 ADSB
884708c1-2fff-4ebf-b72c-bbc6ed2c3623    1.0-RC1 1533134515  DLH012  3.734.542   14.379.951  320 54.0    ADSB    EVA12-1532928367-airline-0096   DLH EVA012  462.0   32000.0 ADSB

i tried to use pandas read json but i get a error.

import pandas as pd

df = pd.read_json("tD.txt",orient='columns')

df.head()

ValueError: Trailing data

tD.txt has the above given snippet without the last (...) dots

I think the problem is, that every json is just appended. I could add a new line after every

messageSubtype":"ADSB"}}

and then read it, but maybe you have a solution where i can just convert the big txt file directly and convert it easily to a df

7
  • What error did you get when using pd.read_json? Commented Jan 9, 2019 at 9:22
  • @user3471881 edited my question Commented Jan 9, 2019 at 9:28
  • Looks like there is no separator between your json objects. Either read with lines=True or put the objects in a list and separate the objects with a comma. Commented Jan 9, 2019 at 9:35
  • You're missing a comma after the first dictionary ends. Commented Jan 9, 2019 at 9:38
  • So i need to iterate over the .txt file and add a "," after every ended json, correct? Commented Jan 9, 2019 at 9:38

2 Answers 2

1

Try to get the stream of json to output like the following:

Notice the starting '[' and the ending ']'. Also notice the ',' between each json input.

data = [{
    "positionFlightMessage": {
        "messageUuid": "95b3b6ca-5dd2-44b4-918a-baa51022d143",
        "schemaVersion": "1.0-RC1",
        "timestamp": 1533134514,
        "flightNumber": "DLH1601",
        "position": {
            "waypoint": {
                "latitude": 44.14525,
                "longitude": -1.31849
            },
            "flightLevel": 340,
            "heading": 24.0
        },
        "messageSource": "ADSB",
        "flightUniqueId": "AFR1601-1532928365-airline-0002",
        "airlineIcaoCode": "AFR",
        "atcCallsign": "AFR89GA",
        "fuel": {},
        "speed": {
            "groundSpeed": 442.0
        },
        "altitude": {
            "altitude": 34000.0
        },
        "nextPosition": {
            "waypoint": {}
        },
        "messageSubtype": "ADSB"
    }
}, {
    "positionFlightMessage": {
        "messageUuid": "884708c1-2fff-4ebf-b72c-bbc6ed2c3623",
        "schemaVersion": "1.0-RC1",
        "timestamp": 1533134515,
        "flightNumber": "DLH012",
        "position": {
            "waypoint": {
                "latitude": 37.34542,
                "longitude": 143.79951
            },
            "flightLevel": 320,
            "heading": 54.0
        },
        "messageSource": "ADSB",
        "flightUniqueId": "EVA12-1532928367-airline-0096",
        "airlineIcaoCode": "DLH",
        "atcCallsign": "EVA012",
        "fuel": {},
        "speed": {
            "groundSpeed": 462.0
        },
        "altitude": {
            "altitude": 32000.0
        },
        "nextPosition": {
            "waypoint": {}
        },
        "messageSubtype": "ADSB"
    }
}]

Now you should be able to loop over each 'list' element in the json and append it to the pandas df.

print(len(data))
for i in range(0,len(data)):
    #here is just show messageSource only. Up to you to find out the rest..
    print(data[i]['positionFlightMessage']['messageSource'])
    #instead of printing here you should append it to pandas df.

Hope this helps you out a bit.

Sign up to request clarification or add additional context in comments.

Comments

0

Now here's a solution for your JSON as is using regex.

s = '{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}'

import re
import json
replaced = json.loads('['+re.sub(r'{\"positionFlightMessage*', ',{\"positionFlightMessage', s)[1:] + ']')

dfTemp = pd.DataFrame(data=replaced)
df = pd.DataFrame()
counter = 0
def newDf(row):
  global df,counter
  counter += 1
  temp = pd.DataFrame([row])
  df = df.append(temp)
dfTemp['positionFlightMessage'] = dfTemp['positionFlightMessage'].apply(newDf)
print(df)
  1. First we replace all occurrences of {"positionFlightMessage with ,{"positionFlightMessage and discard the first separator.
  2. We create a dataframe out of this but we have only one column here. Use the apply function on the column and create a new dataframe out of it.
  3. From this dataframe, you can perform some more cleaning.

6 Comments

Thanks for your solution. I am getting after replaced = json.loads...the following error when i try to load the full extract (1gb) : Expecting , delimiter: line 1 column 719443613
@Bernouy Do the JSON objects always start with a {"positionFlightMessage?
yes it should start exactly like this. how can check what the problem is? how can i look into column 719443613 without open the file in a editor
@Bernouy Try doing s[719443607:719443617]
@Bernouy Well that is weird. Can you get a bigger chunk? 600-650?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.