0

I am trying to create a python script which can parse the following type of log entry which comprises of keys and values. For each key, there may or may not be another nested pair of keys and values. An example is as below. THe depth of the nesting can vary depeding on the log i get so it has to be dynamic. THe depth is however encapsulated with braces.

The string I will have with keys and values are something like this:

   Countries =     {
    "USA" = 0;
    "Spain" = 0;
    Connections = 1;
    Flights =         {
        "KLM" = 11;
        "Air America" = 15;
        "Emirates" = 2;
        "Delta" = 3;
    };
    "Belgium" = 1;
    "Czech Republic" = 0;
    "Netherlands" = 1;
    "Hungary" = 0;
    "Luxembourg" = 0;
    "Italy" = 0;

};

THe data above can have multiple nests as well. I would like to write a function that will parse through this and put it in an array of data (or similar) such that I could get a the value of a specific key like:

    print countries.belgium
          value should be printed as 1

likewise,

    print countries.flights.delta
          value should be printed as 3.

Note that the input doesnt need to have quotes in all the keys (like connections or flights).

Any pointers to what I can start with. Any python libraries that can already do some parsing like this?

3 Answers 3

1

I have created a sample python script that will do the job, just tweak it as your like. It converts you format into a nested dict. And it is as dynamic as you like.

Take a look at here: Paste bin Code:

import re
import ast

data = """ { Countries = { USA = 1; "Connections" = { "1 Flights" = 0; "10 Flights" = 0; "11 Flights" = 0; "12 Flights" = 0; "13 Flights" = 0; "14 Flights" = 0; "15 Flights" = 0; "16 Flights" = 0; "17 Flights" = 0; "18 Flights" = 0; "More than 25 Flights" = 0; }; "Single Connections" = 0; "No Connections" = 0; "Delayed" = 0; "Technical Fault" = 0; "Others" = 0; }; }"""


def arrify(string):
    string = string.replace("=", " : ")
    string = string.replace(";", " , ")
    string = string.replace("\"", "")
    stringDict = string.split()
    # print stringDict
    newArr = []
    quoteCosed = True
    for i, splitStr in enumerate(stringDict):
        if i > 0:
            # print newArr
            if not isDelim(splitStr):
                if isDelim(newArr[i-1]) and quoteCosed:
                    splitStr = "\"" + splitStr
                    quoteCosed = False

                if isDelim(stringDict[i+1]) and not quoteCosed:
                    splitStr += "\""
                    quoteCosed = True

        newArr.append(splitStr)   

    newString = " ".join(newArr)
    newDict = ast.literal_eval(newString)
    return normalizeDict(newDict)

def isDelim(string):
    return str(string) in "{:,}"


def normalizeDict(dic):
    for key, value in dic.items():
        if type(value) is dict:
            dic[key] = normalizeDict(value)
            continue
        dic[key] = normalize(value)
    return dic

def normalize(string):
    try:
        return int(string)
    except:
        return string

print arrify(data)

The result from your sample data:

{'Countries': {'USA': 1, 'Technical Fault': 0, 'No Connections': 0, 'Delayed': 0, 'Connections': {'17 Flights': 0, '10 Flights': 0, '11 Flights': 0, 'More than 25 Flights': 0, '14 Flights': 0, '15 Flights': 0, '12 Flights': 0, '18 Flights': 0, '16 Flights': 0, '1 Flights': 0, '13 Flights': 0}, 'Single Connections': 0, 'Others': 0}}

And you can get values like a normal dict would :) hope it helps ...

Sign up to request clarification or add additional context in comments.

10 Comments

You really need to include the code in your answer. Just linking to it is not good enough.
@richmondwang, exactly what I was looking for. However, my dynamic string this time is as below, and this gave me a syntax error:
What data did you pass? @user2605278
Ahh. its because of the preceeding numeric value of the keys. I'll modify it.
just enclose your data with { data_string } so you dont get parsing error :)
|
1

Iterate over the data and check if the element is another key-value pair, If it is, then call the function recursively. Something like this:

def parseNestedData(data):
    if isinstance(data, dict):
        for k in data.keys():
            parseNestedData(data.get(k))
    else:
        print data

Output:

>>> Countries =     {
"USA" : 0,
"Spain" : 0,
"Connections" : 1,
"Flights" :         {
    "KLM" : 11,
    "Air America" : 15,
    "Emirates" : 2,
    "Delta" : 3,
},
"Belgium" : 1,
"Czech Republic" : 0,
"Netherlands" : 1,
"Hungary" : 0,
"Luxembourg" : 0,
"Italy" :0
};

>>> Countries
{'Connections': 1,
'Flights': {'KLM': 11, 'Air America': 15, 'Emirates': 2, 'Delta': 3},
 'Netherlands': 1,
'Italy': 0,
'Czech Republic': 0,
'USA': 0,
'Belgium': 1,
'Hungary': 0,
'Luxembourg': 0, 'Spain': 0}
>>> parseNestedData(Countries)
1
11
15
2
3
1
0
0
0
1
0
0
0

4 Comments

Thanks Himanshu. How can I get just the value of say Czech Republic (should return me just 0)
also this needs some pre-processing? Because not all keys are enclosed with double quotes, for example - Connections
If you know that Czech Republic key is present at the first level, then just do data.get('Czech Republic')
Any key present in data should be immutable, i.e, it can be of type string, integer or tuple. Just Connections is invalid, that is why I have edited the question.
1

Defining a Class structure to process and store the information, could give you something like this:

import re

class datastruct():
    def __init__(self,data_in):
        flights = re.findall('(?:Flights\s=\s*\{)([\s"A-Z=0-9;a-z]*)};',data_in)
        flight_dict = {}
        for flight in flights[0].split(';')[0:-1]:
            key,val = self.split_data(flight)
            flight_dict[key] = val

        countries = re.findall('("[A-Za-z]+\s?[A-Za-z]*"\s=\s[0-9]{1,2})',data_in)
        countries_dict = {}
        for country in countries:
            key,val = self.split_data(country)
            if key not in flight_dict:
                countries_dict[key]=val

        connections = re.findall('(?:Connections\s=\s)([0-9]*);',data_in)
        self.country= countries_dict
        self.flight = flight_dict
        self.connections = int(connections[0])

    def split_data(self,data2):
        item = data2.split('=')
        key = item[0].strip().strip('"')
        val = int(item[1].strip())
        return key,val

Please note the Regex may need tweaking if the data is not exactly as I've assumed below. The data could be set-up and referenced as follows:

raw_data = 'Countries =     {    "USA" = 0;    "Spain" = 0;    Connections = 1;    Flights =         {        "KLM" = 11;        "Air America" = 15;        "Emirates" = 2;        "Delta" = 3;    };    "Belgium" = 1;    "Czech Republic" = 0;    "Netherlands" = 1;    "Hungary" = 0;    "Luxembourg" = 0;    "Italy" = 0;};'

flight_data = datastruct(raw_data)
print("No. Connections:",flight_data.connections)
print("Country 'USA':",flight_data.country['USA'],'\n'
print("Flight 'KLM':",flight_data.flight['KLM'],'\n')

for country in flight_data.country.keys():
    print("Country: {0} -> {1}".format(country,flight_data.country[country]))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.