1

I'm trying to extract ALL phone screen resolutions from the WURFL XML file with the below Python script. The problem is that I only get the first match, though. Why? How could I get all matches?

The WURFL XML file can be found at http://sourceforge.net/projects/wurfl/files/WURFL/latest/wurfl-latest.zip/download?use_mirror=freefr

def read_file(file_name):
    f = open(file_name, 'rb')
    data = f.read()
    f.close()
    return data

text = read_file('wurfl.xml')

import re
pattern = '<device id="(.*?)".*actual_device_root="true">.*<capability name="resolution_width" value="(\d+)"/>.*<capability name="resolution_height" value="(\d+)"/>.*</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print(m)

4 Answers 4

1

First, use an XML parser instead of regular expressions. You'll be happier in the long run.

Second, if you insist on using regexes, use finditer() instead of findall().

Third, your regex matches from the first entry to the last one (the .* is greedy, and you have set DOTALL mode), so either see the first paragraph or at least change your regex to

pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'

Also, always use raw strings with regexes. \d happens to work, \b will behave unexpectedly in a "normal" string, though.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your input but I still only get one match
Oops, I overlooked one greedy quantifier. Please try again with the edited regex.
0

This is an oddness in the behaviour of findall, specifically findall only returns the first matching group from each pattern match. See this question.

1 Comment

Clarification: I get the groups I'm interested in but findall only returns ONE match (the first match) for some reason.
0

You are using "greedy" matches: .* will match as much text as it can grab, which means the .* before <capabilities> matches most of the file.

text = open('wurfl.xml').read()
pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print m

Comments

0

I'm certainly not averse to handling xml with a regexp if the requirements are simple, but perhaps in this case using a real xml parser would be better. Using the stdlib etree module and a sprinkling of (imho) hideous xpaths:

import xml.etree.ElementTree as ET

def capability_value(cap_elem):
    if cap_elem is None:
        return None
    return int(cap_elem.attrib.get('value'))

def devices(wurfl_doc):
    for el in wurfl_doc.findall("/devices/device[@actual_device_root='true']"):
        width = el.find("./group[@id='display']/capability[@name='resolution_width']")
        width = capability_value(width)
        height = el.find("./group[@id='display']/capability[@name='resolution_height']")
        height = capability_value(height)
        device = {
            'id' : el.attrib.get('id'), 
            'resolution' : {'width': width, 'height': height}
        }
        yield device

doc = ET.ElementTree(file='wurfl.xml')
for device in devices(doc):
    print device

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.