2

I have a text file in the following format:

1. AUTHOR1

(blank line, with a carriage return)

Citation1

2. AUTHOR2

(blank line, with a carriage return)

Citation2

(...)

That is, in this file, some lines begin with an integer number, followed by a dot, a space, and text indicating an author's name; these lines are followed by a blank line (which includes a carriage return), and then for a line of text beginning with an alphabetic character (an article or book citation).

What I want is to read this file into a Python list, joining the author's names and citation, so that each list element is of the form:

['AUTHOR1 Citation1', 'AUTHOR2 Citation2', '...']

It looks like a simple programming problem, but I could not figure out a solution to it. What I attempted was as follows:

articles = []
with open("sample.txt", "rb") as infile:
    while True:
        text = infile.readline()
        if not text: break
        authors = ""
        citation = ""
        if text == '\n': continue
        if text[0].isdigit():
           authors = text.strip('\n')
        else:
           citation = text.strip('\n'
        articles.append(authors+' '+citation)

but the articles list gets authors and citations stored as separate elements!

Thanks in advance for any help in solving this vexing problem... :-(

6 Answers 6

2

Assuming your input file structure:

"""
1. AUTHOR1

Citation1
2. AUTHOR2

Citation2
"""

is not going to change I would use readlines() and slicing:

with open('sample.txt', 'r') as infile:
    lines = infile.readlines()
    if lines:
        lines  = filter( lambda x : x != '\n', lines ) # remove empty lines
        auth   = map( lambda x : x.strip().split('.')[-1].strip(), lines[0::2] )
        cita   = map( lambda x : x.strip(), lines[1::2] )
        result = [ '%s %s'%(auth[i], cita[i]) for i in xrange( len( auth ))  ]
        print result

# ['AUTHOR1 Citation1', 'AUTHOR2 Citation2']
Sign up to request clarification or add additional context in comments.

Comments

1

The problem is that, in each looping iteration you are only getting one, author or citation and not both. So, when you do the append you only have one element.

One way to fix this is to read both in each looping iteration.

Comments

1

This should work:

articles = []
with open("sample.txt") as infile:
    for raw_line in infile:
        line = raw_line.strip()
        if not line:
            continue
        if line[0].isdigit():
            author = line.split(None, 1)[-1]
        else:
            articles.append('{} {}'.format(author, line))

Comments

1

Solution processing a full entry in each loop iteration:

citations = []
with open('sample.txt') as file:
    for author in file:                  # Reads an author line
        next(file)                       # Reads and ignores the empty line
        citation = next(file).strip()    # Reads the citation line
        author = author.strip().split(' ', 1)[1]
        citations.append(author + ' ' + citation)
print(citations)

Solution first reading all lines and then going through them:

citations = []
with open('sample.txt') as file:
    lines = list(map(str.strip, file))
    for author, citation in zip(lines[::3], lines[2::3]):
        author = author.split(' ', 1)[1]
        citations.append(author + ' ' + citation)
print(citations)

Comments

1

The solutions based on slicing are pretty neat, but if there's just one blank line out of place, it throws the whole thing off. Here's a solution using regex which should work even if there's a variation in the structure:

import re

pattern = re.compile(r'(^\d\..*$)\n*(^\w.*$)', re.MULTILINE)
with open("sample.txt", "rb") as infile:
    lines = infile.readlines()
matches = pattern.findall(lines)
formatted_output = [author + ' ' + citation for author, citation in matches]

Comments

1

You can use readline to skip empty lines. Here's your loop body:

author = infile.readline().strip().split(' ')[1]
infile.readline()
citation = infile.readline()
articles.append("{} {}".format(author, citation))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.