0

I'm trying to repair a JSON feed using re.sub() regex expressions in Python. (I'm also working with the feed provider to fix it). I have two expressions to fix:

1.

      "milepost":       "
      "milepost":       "723.46

which are missing an end quote, and

2.

    },

}

which shouldn't have the comma. Note, there is no blank line between them, it's just "},\n }" (trouble with this editor...)

I have a short snippet of the feed, located at: http://hardhat.ahmct.ucdavis.edu/tmp/test.txt

Sample code below. Here, I have tests for finding the patterns, and then for doing the replacements. The match for #2 gives some odd results, but I can't see why: Brace matches found: [('}', '\r\n }')]

The match for #1 seems good.

Main problem is, when I do the re.sub, my resulting string has "\x01\x02" in it. I have no clue where this is coming from. Any advice greatly appreciated.

Sample code:

import urllib2
import json
import re

if __name__ == "__main__":
    # wget version of real feed:
    # url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
    # Short text, for milepost and brace substitution test:
    url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
    request = urllib2.urlopen(url)
    rawResponse = request.read()
    # print("Raw response:")
    # print(rawResponse)

    # Find extra comma after end of records:
    p1 = re.compile('(}),(\r?\n *})')
    l1 = p1.findall(rawResponse)
    print("Brace matches found:")
    print(l1)

    # Check milepost:
    #p2 = re.compile('( *\"milepost\": *\")')
    p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
    l2 = p2.findall(rawResponse)
    print("Milepost matches found:")
    print(l2)

    # Do brace substitutions:
    subst = "\1\2"
    response = re.sub(p1, subst, rawResponse)

    # Do milepost substitutions:
    subst = "\1\2\""
    response = re.sub(p2, subst, response)
    print(response)

1 Answer 1

3

You need to use raw strings, or "\1\2" will be interpreted by the Python string processor as ASCII 01 ASCII 02 instead of backslash 1 backslash 2.

Instead of

subst = "\1\2"

use

subst = r"\1\2" # or subst = "\\1\\2"

Things get a bit trickier with the second replacement:

subst = "\1\2\""

needs to become

subst = r'\1\2"' # or subst = "\\1\\2\""
Sign up to request clarification or add additional context in comments.

2 Comments

Outstanding (and quick), Tim! That resolved it. I think I'm off and running now with the actual coding. Thanks!
Great! I just noticed you weren't using raw strings with the regexes either - it's a good idea to get into the habit of always using raw strings with Python regexes because even though it may work in some cases without them, it might fail unexpectedly in others (for example the regex "\b" will match a backspace character and not (like r"\b" would) a word boundary anchor)...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.