I'm trying to repair a JSON feed using re.sub() regex expressions in Python. (I'm also working with the feed provider to fix it). I have two expressions to fix:
1.
"milepost": "
"milepost": "723.46
which are missing an end quote, and
2.
},
}
which shouldn't have the comma. Note, there is no blank line between them, it's just "},\n }" (trouble with this editor...)
I have a short snippet of the feed, located at: http://hardhat.ahmct.ucdavis.edu/tmp/test.txt
Sample code below. Here, I have tests for finding the patterns, and then for doing the replacements. The match for #2 gives some odd results, but I can't see why: Brace matches found: [('}', '\r\n }')]
The match for #1 seems good.
Main problem is, when I do the re.sub, my resulting string has "\x01\x02" in it. I have no clue where this is coming from. Any advice greatly appreciated.
Sample code:
import urllib2
import json
import re
if __name__ == "__main__":
# wget version of real feed:
# url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
# Short text, for milepost and brace substitution test:
url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
request = urllib2.urlopen(url)
rawResponse = request.read()
# print("Raw response:")
# print(rawResponse)
# Find extra comma after end of records:
p1 = re.compile('(}),(\r?\n *})')
l1 = p1.findall(rawResponse)
print("Brace matches found:")
print(l1)
# Check milepost:
#p2 = re.compile('( *\"milepost\": *\")')
p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
l2 = p2.findall(rawResponse)
print("Milepost matches found:")
print(l2)
# Do brace substitutions:
subst = "\1\2"
response = re.sub(p1, subst, rawResponse)
# Do milepost substitutions:
subst = "\1\2\""
response = re.sub(p2, subst, response)
print(response)