Parsing XML in Python with regex

Question

I'm trying to use regex to parse an XML file (in my case this seems the simplest way).

For example a line might be:

line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'

To access the text for the tag City_State, I'm using:

attr = re.match('>.*<', line)

but nothing is being returned.

Can someone point out what I'm doing wrong?

Using a proper XML library isn't hard once you find a library you like. I found ElementTree the nicest to use one in the standard library, and untangle the easiest (it converts XML into regular dictionaries/lists etc) — dbr
– dbr, Commented Aug 11, 2013 at 4:32
>Can someone point out what I'm doing wrong? A: you are trying to parse XML using regular expressions. — Michael Kay
– Michael Kay, Commented Aug 11, 2013 at 12:10
I have tried ElementTree before and I am getting memory issues. The file size is 250Mb. Since the XML file I am parsing is very simple, I figured it is easier to use regex. — user2671656
– user2671656, Commented Aug 11, 2013 at 12:38

TerryA · Accepted Answer · 2013-08-11 04:25:09Z

23

You normally don't want to use re.match. Quoting from the docs:

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Note:

>>> print re.match('>.*<', line)
None
>>> print re.search('>.*<', line)
<_sre.SRE_Match object at 0x10f666238>
>>> print re.search('>.*<', line).group(0)
>PLAINSBORO, NJ 08536-1906<

Also, why parse XML with regex when you can use something like BeautifulSoup :).

>>> from bs4 import BeautifulSoup as BS
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> soup = BS(line)
>>> print soup.find('city_state').text
PLAINSBORO, NJ 08536-1906

edited Aug 11, 2013 at 4:25

answered Aug 11, 2013 at 4:19

TerryA

60.2k11 gold badges122 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Viktor Kerkez · Accepted Answer · 2013-08-11 09:43:50Z

9

Please, just use an XML parser like ElementTree

>>> from xml.etree import ElementTree as ET
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> ET.fromstring(line).text
'PLAINSBORO, NJ 08536-1906'

answered Aug 11, 2013 at 9:43

Viktor Kerkez

46.8k13 gold badges109 silver badges88 bronze badges

Comments

Kyle · Accepted Answer · 2013-08-11 04:26:52Z

0

re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search.

And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.

answered Aug 11, 2013 at 4:26

Kyle

1782 silver badges8 bronze badges

2 Comments

Charles Duffy Over a year ago

It would only be "a simple way to parse XML" if it actually did parse XML. Which it doesn't. (See: lack of support for detecting comment or CDATA blocks; for handling character entities; etc etc etc).

torek Over a year ago

Minor point: re.match is left side anchored but does not have to consume the entire string. Very loosely, given regexp X, re.match is like re.search using ^X (but not ^X$). There are other differences, particularly with strings containing newlines; see documentation link in Haidro's answer.

Collectives™ on Stack Overflow

Parsing XML in Python with regex

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related