Regular Expressions in data parsing Python

Question

I'm relatively to regular expressions and am amazed at how powerful they are. I have this project and was wondering if regular expressions would be appropriate and how to use them.

In this project I am given a file with a bunch of data. Here's a bit of it:

* File "miles.dat" from the Stanford GraphBase (C) 1993 Stanford University
* Revised mileage data for highways in the United States and Canada, 1949
* This file may be freely copied but please do not change it in any way!
* (Checksum parameters 696,295999341)

Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410

It has a city name and the state, then in brackets the latitude and longitude, then the population. In the next line the distance from that city to each of the cities listed before it in the data. The data goes on for 180 cities.

My job is to create 4 lists. One for the cities, one for the coordinates, one for population, and one for distances between cities. I know this is possible without regular expressions( I have written it), but the code is clunky and not as efficient as possible. What do you think would be the best way to approach this?

Oh! So you are amazed at how powerful regexps are and you want someone to write the code for you? — devnull
– devnull, Commented Mar 18, 2014 at 2:47
I don't understand the relevance of the "Five 'miles.dat'...(Checksum parameters ...)" section. — aliteralmind
– aliteralmind, Commented Mar 18, 2014 at 2:53
More like tell me what patterns I can use etc. I don't need to do this. I've already finished my project. I just thought this would be a good place to learn how to use them. Guess you're not the right teacher. — Shahaed
– Shahaed, Commented Mar 18, 2014 at 2:54

Hugh Bothwell · Accepted Answer · 2014-03-18 15:27:16Z

I would recommend a regex for the city lines and a list comprehension for the distances (a regex would be overkill and slower as well).

Something like

import re

CITY_REG = re.compile(r"([^[]+)\[([0-9.]+),([0-9.]+)\](\d+)")
CITY_TYPES = (str, float, float, int)

def get_city(line):
    match = CITY_REG.match(line)
    if match:
        return [type(dat) for dat,type in zip(match.groups(), CITY_TYPES)]
    else:
        raise ValueError("Failed to parse {} as a city".format(line))

def get_distances(line):
    return [int(i) for i in line.split()]

then

>>> get_city("Youngstown, OH[4110.83,8065.14]115436")
['Youngstown, OH', 4110.83, 8065.14, 115436]

>>> get_distances("1513 2410")
[1513, 2410]

and you can use it like

# This code assumes Python 3.x
from itertools import count, zip_longest

def file_data_lines(fname, comment_start="* "):
    """
    Return lines of data from file
     (strip out blank lines and comment lines)
    """
    with open(fname) as inf:
        for line in inf:
            line = line.rstrip()
            if line and not line.startswith(comment_start):
                yield line

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

def city_data(fname):
    data = file_data_lines(fname)

    # city 0 has no distances line
    city_line = next(data)
    city, lat, lon, pop = get_city(city_line)
    yield city, (lat, lon), pop, []

    # all remaining cities
    for city_line, dist_line in grouper(data, 2, ''):
        city, lat, lon, pop = get_city(city_line)
        dists = get_distances(dist_line)
        yield city, (lat, lon), pop, dists

and finally

def main():
    # load per-city data
    city_info = list(city_data("miles.dat"))
    # transpose into separate lists
    cities, coords, pops, dists = list(zip(*city_info))

if __name__=="__main__":
    main()

Edit:

How it works:

CITY_REG = re.compile(r"([^[]+)\[([0-9.]+),([0-9.]+)\](\d+)")

[^[] matches any character except [; so ([^[]+) gets one or more characters up to (but not including) the first [; this gets "City Name, State", and returns it as the first group.

\[ matches a literal [ character; we have to escape it with a slash to make it clear that we are not starting another character-group.

[0-9.] matches 0, 1, 2, 3, ... 9, or a period character. So ([0-9.]+) gets one or more digits or periods - ie any integer or floating-point number, not including a mantissa - and returns it as the second group. This is under-constrained - it would accept something like 0.1.2.3, which is not a valid float - but an expression which only matched valid floats would be quite a bit more complicated, and this is sufficient for this purpose, assuming we will not run into anomalous input.

We get the comma, match another number as group 3, get the closing square-bracket; then \d matches any digit (same as [0-9]), so (\d+) matches one or more digits, ie an integer, and returns it as the fourth group.

match = CITY_REG.match(line)

We run the regular expression against a line of input; if it matches, we get back a Match object containing the matched data, otherwise we get None.

if match:

... this is a short-form way of saying if bool(match) == True. bool(MyClass) is always True (except when specifically overridden, ie for empty lists or dicts), bool(None) is always False, so effectively "if the regular expression successfully matched the string:".

CITY_TYPES = (str, float, float, int)

Regular expressions only return strings; you want different data types, so we have to convert, which is what

[type(dat) for dat,type in zip(match.groups(), CITY_TYPES)]

does; match.groups() is the four pieces of matched data, and CITY_TYPES is the desired data-type for each, so zip(data, types) returns something like [("Youngstown, OH", str), ("4110.83", float), ("8065.14", float), ("115436", int)]. We then apply the data type to each piece, ending up with ["Youngstown, OH", 4110.83, 8065.14, 115436].

Hope that helps!

Thanks you so much! You work is well beyond what I needed and expected! The rest of your program is easy enough to follow since I am familiar with python enough. However, the beginning still confuses me slightly. The pattern you used in the beginning for the re.compiler is very hard to follow. Could you please explain it? And how does the return statement in the if statement work? Especially the zip(). I've used it once before to make 2 lists into a dict, but this implementation is foreign to me. Thanks again! You seem to be an extremely smart programmer.

fotocoder · Accepted Answer · 2014-03-18 03:57:25Z

0

S = """Youngstown, OH[4110,8065]115436
    Yankton, SD[4288,9739]12011
    966
    Yakima, WA[4660,12051]49826
    1513 2410"""

import re

def get_4_list():
    city_list = []
    coordinate_list = []
    population_list = []
    distance_list = []

    line_list = S.split('\n')
    line_pattern = re.compile(r'(\w+).+(\[[\d,]+\])(\d+)')
    for each_line in line_list:
        match_list = line_pattern.findall(each_line)
        if match_list:
            print match_list
            city_list.append(match_list[0][0])
            coordinate_list.append(match_list[0][1])
            population_list.append(match_list[0][2])
        else:
            distance_list.extend(each_line.split())

    return city_list, coordinate_list, population_list, distance_list

answered Mar 18, 2014 at 3:57

fotocoder

453 bronze badges

1 Comment

Shahaed Over a year ago

Thank you for your response. I just wanted to see what regular expressions would work. You went well beyond and completed my problem in a quarter of the lines my answer was. My question here is what the pattern inside the re.compiler means. From what I understand, it means read raw string, then a word with more than one character, than any digits with brackets, then another string of digits? Is that right?

Collectives™ on Stack Overflow

Regular Expressions in data parsing Python

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related