Parse a plain text file into a CSV file using Python

Question

I have a series of HTML files that are parsed into a single text file using Beautiful Soup. The HTML files are formatted such that their output is always three lines within the text file, so the output will look something like:

Hello!
How are you?
Well, Bye!

But it could just as easily be

83957
And I ain't coming back!
hgu39hgd

In other words, the contents of the HTML files are not really standard across each of them, but they do always produce three lines.

So, I was wondering where I should start if I want to then take the text file that is produced from Beautiful Soup and parse that into a CSV file with columns such as (using the above examples):

Title   Intro   Tagline
Hello!    How are you?    Well, Bye!
83957    And I ain't coming back!    hgu39hgd

The Python code for stripping the HTML from the text files is this:

import os
import glob
import codecs
import csv
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read())
    with open("extracted.txt", "a") as myfile:
        myfile.write(soup.get_text())

And I gather I can use this to set up the columns in my CSV file:

csv.put_HasColumnNames(True)

csv.SetColumnName(0,"title")
csv.SetColumnName(1,"intro")
csv.SetColumnName(2,"tagline")

Where I'm drawing blank is how to iterate through the text file (extracted.txt) one line at a time and, as I get to a new line, set it to the correct cell in the CSV file. The first several lines of the file are blank, and there are many blank lines between each grouping of text. So, first I would need to open the file and read it:

file = open("extracted.txt")

for line in file.xreadlines():
    pass # csv.SetCell(0,0 X) (obviously, I don't know what to put in X)

Also, I don't know how to tell Python to just keep reading the file, and adding to the CSV file until it's finished. In other words, there's no way to know exactly how many total lines will be in the HTML files, and so I can't just csv.SetCell(0,0) to cdv.SetCell(999,999)

I'm not sure I understand what you're trying to do. Are you trying to read the extracted.txt file, ignore empty lines, and place each group of three lines into a single row in a CSV file? — icktoofay
– icktoofay, Commented Apr 27, 2013 at 4:58
Ah, almost. I'm trying to read the first of three lines and set it to "title" and the second of three lines and set it to "intro" and the third of three lines and set it to "tagline" and then skip the white space until I get to the next three lines, and then do it again. — user1183556
– user1183556, Commented Apr 27, 2013 at 5:01
Also, there is whitespace between the very first "title" and the top of the file. — user1183556
– user1183556, Commented Apr 27, 2013 at 5:01
I'm thinking I need to use fileIN = open(sys.argv[1], "r") and line = fileIN.readline(). But I can't figure out how to skip the whitespace, or what to do with the text once I get it? — user1183556
– user1183556, Commented Apr 27, 2013 at 5:04

icktoofay · Accepted Answer · 2013-04-27 05:57:44Z

6

I'm not entirely sure what CSV library you're using, but it doesn't look like Python's built-in one. Anyway, here's how I'd do it:

import csv
import itertools

with open('extracted.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line for line in stripped if line)
    grouped = itertools.izip(*[lines] * 3)
    with open('extracted.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro', 'tagline'))
        writer.writerows(grouped)

This sort of makes a pipeline. It first gets data from the file, then removes all the whitespace from the lines, then removes any empty lines, then groups them into groups of three, and then (after writing the CSV header) writes those groups to the CSV file.

To combine the last two columns as you mentioned in the comments, you could change the writerow call in the obvious way and the writerows to:

writer.writerows((title, intro + tagline) for title, intro, tagline in grouped)

edited Apr 27, 2013 at 5:57

answered Apr 27, 2013 at 5:09

icktoofay

130k23 gold badges261 silver badges239 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Oscar Mederos Over a year ago

In my opinion, I think the generator is more clear (as you had before the edit).

icktoofay Over a year ago

@OscarMederos: It had a bug: it didn't strip the newlines off before grouping. Nevertheless, I guess I can rewrite it with generator comprehensions again.

user1183556 Over a year ago

@icktoofay I've never heard of itertools, thanks for pointing me that way. When I run this, I get the error: File "csvify.py", line5, in <module> lines = itertools.ifilter(bool, itertools.imap(str.strip, in_file)) AttributeError: 'module' object has to attribute 'ifilter'

icktoofay Over a year ago

@ZacBrown: That's kind of odd. itertools.ifilter has no "New in version X" thing, so that would make me believe that it existed when itertools was introduced in version 2.3, but obviously it imported successfully, so I don't really know what's going on there. Anyway, you might want to try my edited version which uses generator comprehensions for that part instead, although it still uses itertools.izip.

user1183556 Over a year ago

I've been having some other issues today with Python. I'm on version 3.3.1 on Windows 7 running in a VM on a Mac. I'll try it out with the version of Python running in OSX and see how it works.

|

Oscar Mederos · Accepted Answer · 2013-04-27 05:19:05Z

2

Perhaps I didn't understand you correctly, but you can do:

file = open("extracted.txt")

# if you don't want to do .strip() again, just create a list of the stripped 
# lines first.
lines = [line.strip() for line in file if line.strip()]

for i, line in enumerate(lines):
    csv.SetCell(i % 3, line)

edited Apr 27, 2013 at 5:19

answered Apr 27, 2013 at 4:59

Oscar Mederos

30k25 gold badges91 silver badges128 bronze badges

2 Comments

user1183556 Over a year ago

This was pretty close, but @icktoofay got it. Still, thanks for your help!

Oscar Mederos Over a year ago

@ZacBrown What do you mean by pretty close? Did you try it? I just tried to keep it as similar as what you were trying (using csv.SetCell, etc). By the way, I upvoted his answer ;)

Collectives™ on Stack Overflow

Parse a plain text file into a CSV file using Python

2 Answers 2

9 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related