1

I have a such text to load: https://sites.google.com/site/iminside1/paste
I'd prefer to create a python dictionary from it, but any object is OK. I tried pickle, json and eval, but didn't succeeded. Can you help me with this?
Thanks!
The results:

a = open("the_file", "r").read()

json.loads(a)
ValueError: Expecting property name: line 1 column 1 (char 1)

pickle.loads(a)
KeyError: '{'

eval(a)
File "<string>", line 19
from: {code: 'DME', airport: "Домодедово", city: 'Москва', country: 'Россия', terminal: ''},
    ^
SyntaxError: invalid syntax
7
  • 1
    How didn't it work? Post the code you tried and how it failed. Commented Aug 30, 2010 at 15:35
  • Wait, are the keys really a wild mix of strings, plain identifiers and plain identifiers that happen to be keywords?? Commented Aug 30, 2010 at 15:54
  • If I understand you right - yes, all the keys are wild mix of strings :) Maybe I need to quote them first? If so, how can I do it without breaking quoted values? Commented Aug 30, 2010 at 16:00
  • It does sorta look like a pickle file to me. Try f = open('the_file', 'r') to open the file for reading, then pickle.load(f) to get the object named "data". Commented Aug 30, 2010 at 16:03
  • I've tried pickle before, see the result above. KeyError: '{' Commented Aug 30, 2010 at 16:08

4 Answers 4

4

Lifted almost straight from the pyparsing examples page:

# read text from web page
import urllib
page = urllib.urlopen("https://sites.google.com/site/iminside1/paste")
html = page.read()
page.close()

start = html.index("<pre>")+len("<pre>")+3 #skip over 3-byte header
end = html.index("</pre>")
text = html[start:end]
print text

# parse dict-like syntax    
from pyparsing import (Suppress, Regex, quotedString, Word, alphas, 
alphanums, oneOf, Forward, Optional, dictOf, delimitedList, Group, removeQuotes)

LBRACK,RBRACK,LBRACE,RBRACE,COLON,COMMA = map(Suppress,"[]{}:,")
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
string_ = Word(alphas,alphanums+"_") | quotedString.setParseAction(removeQuotes)
bool_ = oneOf("true false").setParseAction(lambda t: t[0]=="true")
item = Forward()

key = string_
dict_ = LBRACE - Optional(dictOf(key+COLON, item+Optional(COMMA))) + RBRACE
list_ = LBRACK - Optional(delimitedList(item)) + RBRACK
item << (real | integer | string_ | bool_ | Group(list_ | dict_ ))

result = item.parseString(text,parseAll=True)[0]
print result.data[0].dump()
print result.data[0].segments[0].dump(indent="  ")
print result.data[0].segments[0].flights[0].dump(indent="  -  ")
print result.data[0].segments[0].flights[0].flightLegs[0].dump(indent="  -  -  ")
for seg in result.data[6].segments:
    for flt in seg.flights:
        fltleg = flt.flightLegs[0]
        print "%(airline)s %(airlineCode)s %(flightNo)s" % fltleg,
        print "%s -> %s" % (fltleg["from"].code, fltleg["to"].code)

Prints:

[['index', 0], ['serviceClass', '??????'], ['prices', [3504, ...
- eTicketing: true
- index: 0
- prices: [3504, 114.15000000000001, 89.769999999999996]
- segments: [[['indexSegment', 0], ['stopsCount', 0], ['flights', ... 
- serviceClass: ??????
  [['indexSegment', 0], ['stopsCount', 0], ['flights', [[['index', 0], ...
  - flights: [[['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ...
  - indexSegment: 0
  - stopsCount: 0
  -  [['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ['flight...
  -  - flightLegs: [[['flightNo', '309'], ['eTicketing', 'true'], ['air... 
  -  - index: 0
  -  - minAvailSeats: 9
  -  - stops: []
  -  - time: PT2H45M
  -  -  [['flightNo', '309'], ['eTicketing', 'true'], ['airplane', 'Boe... 
  -  -  - airline: ?????????
  -  -  - airlineCode: UN
  -  -  - airplane: Boeing 737-500
  -  -  - availSeats: 9
  -  -  - classCode: I
  -  -  - eTicketing: true
  -  -  - fareBasis: IPROW
  -  -  - flightClass: ECONOMY
  -  -  - flightNo: 309
  -  -  - from:   -  -  [['code', 'DME'], ['airport', '??????????'], ... 
  -  -    - airport: ??????????
  -  -    - city: ??????
  -  -    - code: DME
  -  -    - country: ??????
  -  -    - terminal: 
  -  -  - fromDate: 2010-10-15
  -  -  - fromTime: 10:40:00
  -  -  - time: 
  -  -  - to:   -  -  [['code', 'TXL'], ['airport', 'Berlin-Tegel'], ... 
  -  -    - airport: Berlin-Tegel
  -  -    - city: ??????
  -  -    - code: TXL
  -  -    - country: ????????
  -  -    - terminal: 
  -  -  - toDate: 2010-10-15
  -  -  - toTime: 11:25:00
airBaltic BT 425 SVO -> RIX
airBaltic BT 425 SVO -> RIX
airBaltic BT 423 SVO -> RIX
airBaltic BT 423 SVO -> RIX

EDIT: fixed grouping and expanded output dump to show how to access individual key fields of results, either by index (within list) or as attribute (within dict).

Sign up to request clarification or add additional context in comments.

3 Comments

Wow. +1 for kicking ass and for choosing YAML as output ;)
Is that YAML? It's just what the dump() method prints out.
Thanks for pyparsing discovery, very useful
3

If you really have to load the bulls... this data is (see my comment), you's propably best of with a regex adding missing quotes. Something like r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:" to find things to quote and r"\'\1\'\:" as replacement (off the top of my head, I have to test it first).

Edit: After some troulbe with backward-references in Python 3.1, I finally got it working with these:

>>> pattern = r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:"
>>> test = '{"foo": {bar: 1}}'
>>> repl = lambda match: '"{}":'.format(match.group(1))
>>> eval(re.sub(pattern, repl, test))
{'foo': {'bar': 1}}

2 Comments

Something wrong.. Trying your code with your example (TypeError: expected string or buffer): /usr/lib/python2.6/re.pyc in sub(pattern, repl, string, count) 149 a callable, it's passed the match object and must return 150 a replacement string to be used.""" --> 151 return _compile(pattern, 0).sub(repl, string, count) 152 153 def subn(pattern, repl, string, count=0): TypeError: expected string or buffer
Mixed repl and string argument order up, fixed it.
1

Till now with help of delnan and a little investigation I can load it into dict with eval:

pattern = r"\b(?P<word>\w+):"
x = re.sub(pattern, '"\g<word>":',open("the_file", "r").read())
y = x.replace("true", '"true"')
d = eval(y)

Still looking for more efficient and maybe simpler solution.. I don't like to use "eval" for some reasons.

2 Comments

Well, it will hardly get more efficient than the built-in eval, but I understand. With the quoting fixed, I suppose it is valid JSON?
unfortunately not :O I still can not load it with json.loads() or pickle.loads().. Strange and confusing - only eval works, I don't understand why. (Shouldn't pickle work??)
0

Extension of the DominiCane's version:

import re

quote_keys_regex = re.compile(r'([\{\s,])(\w+)(:)')


def js_variable_to_python(js_variable):
    """Convert a javascript variable into JSON and then load the value"""
    # when in_string is not None, it contains the character that has opened the string
    # either simple quote or double quote
    in_string = None
    # cut the string:
    # r"""{ a:"f\"irst", c:'sec"ond'}"""
    # becomes
    # ['{ a:', '"', 'f\\', '"', 'irst', '"', ', c:', "'", 'sec', '"', 'ond', "'", '}']
    l = re.split(r'(["\'])', js_variable)
    # previous part (to check the escape character antislash)
    previous_p = ""
    for i, p in enumerate(l):
        # parse characters inside a ECMA string 
        if in_string:
            # we are in a JS string: replace the colon by a temporary character
            # so quote_keys_regex doesn't have to deal with colon inside the JS strings
            l[i] = l[i].replace(':', chr(1))
            if in_string == "'":
                # the JS string is delimited by simple quote.
                # This is not supported by JSON.
                # simple quote delimited string are converted to double quote delimited string
                # here, inside a JS string, we escape the double quote
                l[i] = l[i].replace('"', r'\"')

        # deal with delimieters and escape character
        if not in_string and p in ('"', "'"):
            # we are not in string
            # but p is double or simple quote
            # that's the start of a new string
            # replace simple quote by double quote
            # (JSON doesn't support simple quote)
            l[i] = '"'
            in_string = p
            continue
        if p == in_string:
            # we are in a string and the current part MAY close the string
            if len(previous_p) > 0 and previous_p[-1] == '\\':
                # there is an antislash just before: the JS string continue
                continue
            # the current p close the string
            # replace simple quote by double quote
            l[i] = '"'
            in_string = None
        # update previous_p
        previous_p = p
    # join the string
    s = ''.join(l)
    # add quote arround the key
    # { a: 12 }
    # becomes
    # { "a": 12 }
    s = quote_keys_regex.sub(r'\1"\2"\3', s)
    # replace the surogate character by colon
    s = s.replace(chr(1), ':')
    # load the JSON and return the result
    return json.loads(s)

It deals only with int, null and string. I don't know about float.

Note that the usage chr(1): the code doesn't work if this character in js_variable.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.