Python 3 reading CSV file with line breaks in rows

Question

I have a large CSV file with one column and line breaks in some of its rows. I want to read the content of each cell and write it to a text file but the CSV reader is splitting the cells with line breaks into multiple ones (multiple rows) and writing each one to a separate text file.

Using Python 3.6.2 on a MAC Sierra

Here is an example:

"content of row 1"
"content of row 2 
 continues here"
"content of row 3"

And here is how I am reading it:

with open(csvFileName, 'r') as csvfile:

    lines= csv.reader(csvfile)

    i=0
    for row in lines:
        i+=1
        content= row

        outFile= open("output"+str(i)+".txt", 'w')

        outFile.write(content)

        outFile.close()

This is creating 4 files instead of 3 for each row. Any suggestions on how to ignore the line break in the second row?

That source CSV doesn't seem properly formatted as a CSV. Try using an editor like Microsoft Excel or Google Sheets. They'll output the CSV correctly, with cells containing special characters wrapped in quotation marks. See stackoverflow.com/questions/566052 — andrewgu
– andrewgu, Commented Sep 5, 2017 at 18:42
is the row delimiter literally row#? how can you tell when something is not a 'new' row — EoinS
– EoinS, Commented Sep 5, 2017 at 18:42
strip the row and check if it is equal to empty string before creating files. Like this: content=row.strip() — arshovon
– arshovon, Commented Sep 5, 2017 at 18:43
it still works fine here. python 3.4 windows. Sorry, cannot reproduce. Maybe it's an issue with invisible characters. Can you edit the file using hex editor? Can you try with the input you posted (in a new file) to convince yourself that your original input file has a problem. Because it works fine, as I said — Jean-François Fabre
– Jean-François Fabre ♦, Commented Sep 5, 2017 at 19:06
macintosh has a strange way to terminate lines. check hex editor between your simple file and your big file. Create little extracts of your big files. Check if lines end with 0D or 0D 0A or 0A... all the help I can offer, sorry. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Sep 5, 2017 at 19:30

wwii · Accepted Answer · 2017-09-05 19:19:45Z

2

You could define a regular expression pattern to help you iterate over the rows.

Read the entire file contents - if possible.

s = '''"content of row 1"
"content of row 2 
 continues here"
"content of row 3"'''

Pattern - double-quote, followed by anything that isn't a double-quote, followed by a double-quote.:

row_pattern = '''"[^"]*"'''
row = re.compile(row_pattern, flags = re.DOTALL | re.MULTILINE)

Iterate the rows:

for r in row.finditer(s):
    print r.group()
    print '******'

>>> 
"content of row 1"
******
"content of row 2 
 continues here"
******
"content of row 3"
******
>>>

edited Sep 5, 2017 at 19:19

answered Sep 5, 2017 at 19:14

wwii

23.9k7 gold badges42 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jim Dennis · Accepted Answer · 2017-09-05 19:14:04Z

1

The file you describe is NOT a CSV (comma separated values) file. A CSV file is a list of records one per line where each record is separated from the others by commas. There are various "flavors" of CSV which support various features for quoting fields (in case fields have embedded commas in them, for example).

I think your best bet would be to create an adapter class/instance which would pre-process the raw file, find and merge the continuation lines into records and them pass those to your instance of csv.reader. You could model your class after StringIO from the Python standard libraries.

The point is that you create something which processes data but behaves enough like a file object that it can be used, transparently, as the input source for something like csv.reader().

(Done properly you can even implement the Python context manager protocol. io.StringIO does support this protocol and could be used as a reference. This would allow you to use instances of this hypothetical "line merging" adapter class in a Python with statement just as you're doing with your open file() object in your example code).

from io import StringIO
import csv
data = u'1,"a,b",2\n2,ab,2.1\n'
with StringIO(data) as infile:
    reader = csv.reader(infile, quotechar='"')
    for rec in reader:
        print(rec[0], rec[2], rec[1])

That's just a simple example of using the io.StringIO in a with statement Note that io.StringIO requires Unicode data, io.BytesIO requires "bytes" or string data (at least in 2.7.x). Your adapter class can do whatever you like.

answered Sep 5, 2017 at 19:14

Jim Dennis

17.7k13 gold badges73 silver badges122 bronze badges

2 Comments

Thiago Dantas Over a year ago

Yes it is a valid CSV file, check the spec at seciont 2.6 tools.ietf.org/html/rfc4180

Wojciech Jakubas Over a year ago

It is very much valid CSV. CSV file can contain CR/LF characters inside the field (column) but it needs to be quoted in such case. As per Wikipedia you can see that: Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly.)

Collectives™ on Stack Overflow

Python 3 reading CSV file with line breaks in rows

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related