I would recommend a regex for the city lines and a list comprehension for the distances (a regex would be overkill and slower as well).
Something like
import re
CITY_REG = re.compile(r"([^[]+)\[([0-9.]+),([0-9.]+)\](\d+)")
CITY_TYPES = (str, float, float, int)
def get_city(line):
match = CITY_REG.match(line)
if match:
return [type(dat) for dat,type in zip(match.groups(), CITY_TYPES)]
else:
raise ValueError("Failed to parse {} as a city".format(line))
def get_distances(line):
return [int(i) for i in line.split()]
then
>>> get_city("Youngstown, OH[4110.83,8065.14]115436")
['Youngstown, OH', 4110.83, 8065.14, 115436]
>>> get_distances("1513 2410")
[1513, 2410]
and you can use it like
# This code assumes Python 3.x
from itertools import count, zip_longest
def file_data_lines(fname, comment_start="* "):
"""
Return lines of data from file
(strip out blank lines and comment lines)
"""
with open(fname) as inf:
for line in inf:
line = line.rstrip()
if line and not line.startswith(comment_start):
yield line
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
def city_data(fname):
data = file_data_lines(fname)
# city 0 has no distances line
city_line = next(data)
city, lat, lon, pop = get_city(city_line)
yield city, (lat, lon), pop, []
# all remaining cities
for city_line, dist_line in grouper(data, 2, ''):
city, lat, lon, pop = get_city(city_line)
dists = get_distances(dist_line)
yield city, (lat, lon), pop, dists
and finally
def main():
# load per-city data
city_info = list(city_data("miles.dat"))
# transpose into separate lists
cities, coords, pops, dists = list(zip(*city_info))
if __name__=="__main__":
main()
Edit:
How it works:
CITY_REG = re.compile(r"([^[]+)\[([0-9.]+),([0-9.]+)\](\d+)")
[^[] matches any character except [; so ([^[]+) gets one or more characters up to (but not including) the first [; this gets "City Name, State", and returns it as the first group.
\[ matches a literal [ character; we have to escape it with a slash to make it clear that we are not starting another character-group.
[0-9.] matches 0, 1, 2, 3, ... 9, or a period character. So ([0-9.]+) gets one or more digits or periods - ie any integer or floating-point number, not including a mantissa - and returns it as the second group. This is under-constrained - it would accept something like 0.1.2.3, which is not a valid float - but an expression which only matched valid floats would be quite a bit more complicated, and this is sufficient for this purpose, assuming we will not run into anomalous input.
We get the comma, match another number as group 3, get the closing square-bracket; then \d matches any digit (same as [0-9]), so (\d+) matches one or more digits, ie an integer, and returns it as the fourth group.
match = CITY_REG.match(line)
We run the regular expression against a line of input; if it matches, we get back a Match object containing the matched data, otherwise we get None.
if match:
... this is a short-form way of saying if bool(match) == True. bool(MyClass) is always True (except when specifically overridden, ie for empty lists or dicts), bool(None) is always False, so effectively "if the regular expression successfully matched the string:".
CITY_TYPES = (str, float, float, int)
Regular expressions only return strings; you want different data types, so we have to convert, which is what
[type(dat) for dat,type in zip(match.groups(), CITY_TYPES)]
does; match.groups() is the four pieces of matched data, and CITY_TYPES is the desired data-type for each, so zip(data, types) returns something like [("Youngstown, OH", str), ("4110.83", float), ("8065.14", float), ("115436", int)]. We then apply the data type to each piece, ending up with ["Youngstown, OH", 4110.83, 8065.14, 115436].
Hope that helps!