Python regex issue

Question

I'm trying to extract ALL phone screen resolutions from the WURFL XML file with the below Python script. The problem is that I only get the first match, though. Why? How could I get all matches?

The WURFL XML file can be found at http://sourceforge.net/projects/wurfl/files/WURFL/latest/wurfl-latest.zip/download?use_mirror=freefr

def read_file(file_name):
    f = open(file_name, 'rb')
    data = f.read()
    f.close()
    return data

text = read_file('wurfl.xml')

import re
pattern = '<device id="(.*?)".*actual_device_root="true">.*<capability name="resolution_width" value="(\d+)"/>.*<capability name="resolution_height" value="(\d+)"/>.*</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print(m)

Tim Pietzcker · Accepted Answer · 2011-05-11 12:12:24Z

1

First, use an XML parser instead of regular expressions. You'll be happier in the long run.

Second, if you insist on using regexes, use finditer() instead of findall().

Third, your regex matches from the first entry to the last one (the .* is greedy, and you have set DOTALL mode), so either see the first paragraph or at least change your regex to

pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'

Also, always use raw strings with regexes. \d happens to work, \b will behave unexpectedly in a "normal" string, though.

edited May 11, 2011 at 12:12

answered May 11, 2011 at 11:29

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AOO Over a year ago

Thanks for your input but I still only get one match

Tim Pietzcker Over a year ago

Oops, I overlooked one greedy quantifier. Please try again with the edited regex.

Community · Accepted Answer · 2017-05-23 12:04:19Z

0

This is an oddness in the behaviour of findall, specifically findall only returns the first matching group from each pattern match. See this question.

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered May 11, 2011 at 11:10

verdesmarald

11.9k2 gold badges46 silver badges60 bronze badges

1 Comment

AOO Over a year ago

Clarification: I get the groups I'm interested in but findall only returns ONE match (the first match) for some reason.

Dan · Accepted Answer · 2011-05-11 11:38:13Z

0

You are using "greedy" matches: .* will match as much text as it can grab, which means the .* before <capabilities> matches most of the file.

text = open('wurfl.xml').read()
pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print m

answered May 11, 2011 at 11:38

Dan

1811 silver badge7 bronze badges

Comments

Rob Cowie · Accepted Answer · 2011-05-11 12:39:48Z

I'm certainly not averse to handling xml with a regexp if the requirements are simple, but perhaps in this case using a real xml parser would be better. Using the stdlib etree module and a sprinkling of (imho) hideous xpaths:

import xml.etree.ElementTree as ET

def capability_value(cap_elem):
    if cap_elem is None:
        return None
    return int(cap_elem.attrib.get('value'))

def devices(wurfl_doc):
    for el in wurfl_doc.findall("/devices/device[@actual_device_root='true']"):
        width = el.find("./group[@id='display']/capability[@name='resolution_width']")
        width = capability_value(width)
        height = el.find("./group[@id='display']/capability[@name='resolution_height']")
        height = capability_value(height)
        device = {
            'id' : el.attrib.get('id'), 
            'resolution' : {'width': width, 'height': height}
        }
        yield device

doc = ET.ElementTree(file='wurfl.xml')
for device in devices(doc):
    print device

Collectives™ on Stack Overflow

Python regex issue

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related