Python re.sub returning binary characters

Question

I'm trying to repair a JSON feed using re.sub() regex expressions in Python. (I'm also working with the feed provider to fix it). I have two expressions to fix:

1.

      "milepost":       "
      "milepost":       "723.46

which are missing an end quote, and

2.

},

}

which shouldn't have the comma. Note, there is no blank line between them, it's just "},\n }" (trouble with this editor...)

I have a short snippet of the feed, located at: http://hardhat.ahmct.ucdavis.edu/tmp/test.txt

Sample code below. Here, I have tests for finding the patterns, and then for doing the replacements. The match for #2 gives some odd results, but I can't see why: Brace matches found: [('}', '\r\n }')]

The match for #1 seems good.

Main problem is, when I do the re.sub, my resulting string has "\x01\x02" in it. I have no clue where this is coming from. Any advice greatly appreciated.

Sample code:

import urllib2
import json
import re

if __name__ == "__main__":
    # wget version of real feed:
    # url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
    # Short text, for milepost and brace substitution test:
    url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
    request = urllib2.urlopen(url)
    rawResponse = request.read()
    # print("Raw response:")
    # print(rawResponse)

    # Find extra comma after end of records:
    p1 = re.compile('(}),(\r?\n *})')
    l1 = p1.findall(rawResponse)
    print("Brace matches found:")
    print(l1)

    # Check milepost:
    #p2 = re.compile('( *\"milepost\": *\")')
    p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
    l2 = p2.findall(rawResponse)
    print("Milepost matches found:")
    print(l2)

    # Do brace substitutions:
    subst = "\1\2"
    response = re.sub(p1, subst, rawResponse)

    # Do milepost substitutions:
    subst = "\1\2\""
    response = re.sub(p2, subst, response)
    print(response)

Tim Pietzcker · Accepted Answer · 2014-11-18 17:46:49Z

3

You need to use raw strings, or "\1\2" will be interpreted by the Python string processor as ASCII 01 ASCII 02 instead of backslash 1 backslash 2.

Instead of

subst = "\1\2"

use

subst = r"\1\2" # or subst = "\\1\\2"

Things get a bit trickier with the second replacement:

subst = "\1\2\""

needs to become

subst = r'\1\2"' # or subst = "\\1\\2\""

answered Nov 18, 2014 at 17:46

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ty Lasky Over a year ago

Outstanding (and quick), Tim! That resolved it. I think I'm off and running now with the actual coding. Thanks!

Tim Pietzcker Over a year ago

Great! I just noticed you weren't using raw strings with the regexes either - it's a good idea to get into the habit of always using raw strings with Python regexes because even though it may work in some cases without them, it might fail unexpectedly in others (for example the regex "\b" will match a backspace character and not (like r"\b" would) a word boundary anchor)...

Collectives™ on Stack Overflow

Python re.sub returning binary characters

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related