I am writing a script to report statistics from a text file in Markdown. The file contains book titles and dates. Each date belongs to the titles that follow, until a new date appears. Here is a sample:
#### 8/23/05
Defining the World (Hitchings)
#### 8/26/05
Lost Japan
#### 9/5/05
The Kite Runner
*The Dark Valley (Brendon)*
#### 9/9/05
Active Liberty
I iterate over lines in the file with a for loop and examine each line to see if it's a date. If it's a date, I set a variable this_date. If it's a title, I make it into a dict with the current value of this_date.
There are two exceptions: the file starts with titles, not a date, so I set an initial value for this_date before the for loop. And halfway through the file there is a region where dates were lost, and I set a specific date for those titles.
But in the resulting list of dicts, all the titles are given that date until the lost-data region starts. After that point, the rest of the titles are given the date that appears last in the file. What is most confusing: when I print the contents of this_date right before appending the new dict, it contains the correct value on every loop.
I expect this_date to be visible at all levels of the loop. I know I need to break this up into functions, and passing results explicitly between functions will probably fix the issue, but I'd like to know why this approach didn't work. Thank you very much.
result = []
# regex patterns
ddp = re.compile('\d+') # extract digits
mp = re.compile('^#+\s*\d+') # captures hashes and spaces
dp = re.compile('/\d+/') # captures slashes
yp = re.compile('\d+$')
sp = re.compile('^\*')
# initialize
this_date = {
'month': 4,
'day': 30,
'year': 2005
}
# print('this_date initialized')
for line in text:
if line == '':
pass
else:
if '#' in line: # markdown header format - line is a new date
if 'Reconstructing lost data' in line: # handle exception
# titles after this line are given 12/31/14 (the last date in the file) instead of 8/31/10
# all prior dates are overwritten with 8/31/10
# but the intent is that titles after this line appears have date 8/31/10, until the next date
this_date = {
'month': 8,
'day': 31,
'year': 2010
}
# print('set this_date to handle exception')
else: # get the date from the header
month = ddp.search( mp.search(line).group() ) # digits only
day = ddp.search( dp.search(line).group() ) # digits only
year = yp.search(line)
if month and day and year:
# print('setting this_date within header parse')
this_date['month'] = int(month.group())
this_date['day'] = int(day.group())
this_date['year'] = ( int(year.group()) + 2000 )
else:
pass
else: # line is a title
x = {
'date': this_date,
'read': False
}
if sp.match(line): # starts with asterisk - has been read
x['read'] = True
x['title'] = line[1:-3] # trim trailing asterisk and spaces
else:
x['title'] = line
# this_date is correct when printed here
# print('this_date is ' + str(this_date['month']) + '/' + str(this_date['day']) + '/' + str(this_date['year']) )
result.append(x)
# x has correct date when printed here
# print(x)
# print("Done; found %d titles.") % len(result)
# elements of result have wrong dates (either 8/31/10 or 12/31/14, no other values) when printed here
# print( result[0::20])