8

I have a large CSV file with one column and line breaks in some of its rows. I want to read the content of each cell and write it to a text file but the CSV reader is splitting the cells with line breaks into multiple ones (multiple rows) and writing each one to a separate text file.

Using Python 3.6.2 on a MAC Sierra

Here is an example:

"content of row 1"
"content of row 2 
 continues here"
"content of row 3"

And here is how I am reading it:

with open(csvFileName, 'r') as csvfile:

    lines= csv.reader(csvfile)

    i=0
    for row in lines:
        i+=1
        content= row

        outFile= open("output"+str(i)+".txt", 'w')

        outFile.write(content)

        outFile.close()

This is creating 4 files instead of 3 for each row. Any suggestions on how to ignore the line break in the second row?

17
  • 2
    That source CSV doesn't seem properly formatted as a CSV. Try using an editor like Microsoft Excel or Google Sheets. They'll output the CSV correctly, with cells containing special characters wrapped in quotation marks. See stackoverflow.com/questions/566052 Commented Sep 5, 2017 at 18:42
  • 2
    is the row delimiter literally row#? how can you tell when something is not a 'new' row Commented Sep 5, 2017 at 18:42
  • 1
    strip the row and check if it is equal to empty string before creating files. Like this: content=row.strip() Commented Sep 5, 2017 at 18:43
  • 1
    it still works fine here. python 3.4 windows. Sorry, cannot reproduce. Maybe it's an issue with invisible characters. Can you edit the file using hex editor? Can you try with the input you posted (in a new file) to convince yourself that your original input file has a problem. Because it works fine, as I said Commented Sep 5, 2017 at 19:06
  • 1
    macintosh has a strange way to terminate lines. check hex editor between your simple file and your big file. Create little extracts of your big files. Check if lines end with 0D or 0D 0A or 0A... all the help I can offer, sorry. Commented Sep 5, 2017 at 19:30

2 Answers 2

2

You could define a regular expression pattern to help you iterate over the rows.

Read the entire file contents - if possible.

s = '''"content of row 1"
"content of row 2 
 continues here"
"content of row 3"'''

Pattern - double-quote, followed by anything that isn't a double-quote, followed by a double-quote.:

row_pattern = '''"[^"]*"'''
row = re.compile(row_pattern, flags = re.DOTALL | re.MULTILINE)

Iterate the rows:

for r in row.finditer(s):
    print r.group()
    print '******'

>>> 
"content of row 1"
******
"content of row 2 
 continues here"
******
"content of row 3"
******
>>>
Sign up to request clarification or add additional context in comments.

Comments

1

The file you describe is NOT a CSV (comma separated values) file. A CSV file is a list of records one per line where each record is separated from the others by commas. There are various "flavors" of CSV which support various features for quoting fields (in case fields have embedded commas in them, for example).

I think your best bet would be to create an adapter class/instance which would pre-process the raw file, find and merge the continuation lines into records and them pass those to your instance of csv.reader. You could model your class after StringIO from the Python standard libraries.

The point is that you create something which processes data but behaves enough like a file object that it can be used, transparently, as the input source for something like csv.reader().

(Done properly you can even implement the Python context manager protocol. io.StringIO does support this protocol and could be used as a reference. This would allow you to use instances of this hypothetical "line merging" adapter class in a Python with statement just as you're doing with your open file() object in your example code).

from io import StringIO
import csv
data = u'1,"a,b",2\n2,ab,2.1\n'
with StringIO(data) as infile:
    reader = csv.reader(infile, quotechar='"')
    for rec in reader:
        print(rec[0], rec[2], rec[1])

That's just a simple example of using the io.StringIO in a with statement Note that io.StringIO requires Unicode data, io.BytesIO requires "bytes" or string data (at least in 2.7.x). Your adapter class can do whatever you like.

2 Comments

Yes it is a valid CSV file, check the spec at seciont 2.6 tools.ietf.org/html/rfc4180
It is very much valid CSV. CSV file can contain CR/LF characters inside the field (column) but it needs to be quoted in such case. As per Wikipedia you can see that: Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.