15

I'm trying to use regex to parse an XML file (in my case this seems the simplest way).

For example a line might be:

line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'

To access the text for the tag City_State, I'm using:

attr = re.match('>.*<', line)

but nothing is being returned.

Can someone point out what I'm doing wrong?

7
  • 11
    I am compelled to link this answer. Commented Aug 11, 2013 at 4:21
  • Using a proper XML library isn't hard once you find a library you like. I found ElementTree the nicest to use one in the standard library, and untangle the easiest (it converts XML into regular dictionaries/lists etc) Commented Aug 11, 2013 at 4:32
  • Dang, @Johnsyweb beat me to it! Commented Aug 11, 2013 at 4:58
  • 1
    >Can someone point out what I'm doing wrong? A: you are trying to parse XML using regular expressions. Commented Aug 11, 2013 at 12:10
  • I have tried ElementTree before and I am getting memory issues. The file size is 250Mb. Since the XML file I am parsing is very simple, I figured it is easier to use regex. Commented Aug 11, 2013 at 12:38

3 Answers 3

23

You normally don't want to use re.match. Quoting from the docs:

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Note:

>>> print re.match('>.*<', line)
None
>>> print re.search('>.*<', line)
<_sre.SRE_Match object at 0x10f666238>
>>> print re.search('>.*<', line).group(0)
>PLAINSBORO, NJ 08536-1906<

Also, why parse XML with regex when you can use something like BeautifulSoup :).

>>> from bs4 import BeautifulSoup as BS
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> soup = BS(line)
>>> print soup.find('city_state').text
PLAINSBORO, NJ 08536-1906
Sign up to request clarification or add additional context in comments.

Comments

9

Please, just use an XML parser like ElementTree

>>> from xml.etree import ElementTree as ET
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> ET.fromstring(line).text
'PLAINSBORO, NJ 08536-1906'

Comments

0

re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search.

And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.

2 Comments

It would only be "a simple way to parse XML" if it actually did parse XML. Which it doesn't. (See: lack of support for detecting comment or CDATA blocks; for handling character entities; etc etc etc).
Minor point: re.match is left side anchored but does not have to consume the entire string. Very loosely, given regexp X, re.match is like re.search using ^X (but not ^X$). There are other differences, particularly with strings containing newlines; see documentation link in Haidro's answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.