7

I'm getting CSV formatted data piped from an external source, and the data should not be written to a file, that would open a number of different maintenance tasks I would like to avoid. I am getting the data as a string.

So now I want to interpret the data as CSV. The python module csv is suited for that, so I am using it.

However, constructing a CSV parser using csv.reader (or dictreader) requires the data to support an iterator. A list or a reader would already do that. However, my data contains line breaks (\n) inside quotes strings (which is totally fine for the CSV format), and also line breaks as record end indicators/line terminators.

If I construct it like this:

csvreader = csv.reader(data.split('\n'))

then I have already broken the valid CSV formatting by interpreting the line breaks as record end indicators, even if they were inside quoted strings. For me, the interpretation of the line breaks should be done by the CSV parser, as it is aware of quotes fields containing CSV relevant formatting characters like line breaks.

Do I have a misunderstanding with the way the csv package parser works? Or how would I parse a CSV type string using the CSV parser while the data contains line breaks that do not indicate the end of a record?

Edit 1: It might not have been wise to add context to my question, as the context seems to have overtaken my actual question/problem. To clarify, my problem is not parsing CSV from memory. I have found StringIO already. When using StringIO, all that does is use the whole string/file as the first record.

Edit 2: Maybe sample data will help. My data looks like this:

"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"

So inside the 3rd entry in the first record I have a line break character, which should be interpreted as part of the third entry in the first record.

Edit 3: The way I currently use the parser is this:

csvreader = csv.reader(StringIO(result), quotechar='"', delimiter=';', lineterminator='\n', escapechar='"')

because the spec says that a double quote (") inside a field is escaped by a quote char (""), so the data might look like this:

"value1";"value2";"value3
stillvalue3
a ""quote"" inside the quote"
"value4";"value5";"value6"

It seems that using these settings the StringIO input always results in one record with all the data from the string inside it. Removing the

escapechar='"'

fixes this problem, however, I cannot support escaped double quote characters inside the quotes. So the nature of my question changed. I'll keep this open if anybody has an idea and will accept the answer of Martin Evans later.

3
  • see stackoverflow.com/questions/17767405/… Commented Mar 9, 2021 at 15:17
  • The easiest way is to use StringIO (now from io import StringIO) as shown here: stackoverflow.com/a/18724978/3001761 Commented Mar 9, 2021 at 15:20
  • 1
    The problem is that (apart from the escapechar) a csv.reader reads correctly the only data you have shown. So we have to guess what the problem can be. You have to show some data exhibiting the problem, with what you get and what you want. Commented Mar 10, 2021 at 8:20

1 Answer 1

8

You can use Python's StringIO() to give your data string a stream interface.

Do not attempt to use data.split('\n') as this will fail to split the lines correctly, the csv.reader() will correctly read the lines in, even if cells contain newlines:

from io import StringIO
import csv

data = '''"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"'''

csv_input = csv.reader(StringIO(data, newline=''), delimiter=';')
rows = list(csv_input)
print("Rows", rows)    

Giving you:

Rows [['value1', 'value2', 'value3\nstillvalue3'], ['value4', 'value5', 'value6']]

This could then be loaded into Pandas, or directly using StringIO():

columns = ['col1', 'col2', 'col3']

df = pd.DataFrame(rows, columns=columns)

df = pd.read_csv(StringIO(data, newline=''), names=columns, sep=';')

print(df)

Giving:

     col1    col2                 col3
0  value1  value2  value3\nstillvalue3
1  value4  value5               value6

Your second example gives:

Rows [['value1', 'value2', 'value3\nstillvalue3\na "quote" inside the quote'], ['value4', 'value5', 'value6']]

     col1    col2                                             col3
0  value1  value2  value3\nstillvalue3\na "quote" inside the quote
1  value4  value5                                           value6
Sign up to request clarification or add additional context in comments.

5 Comments

Please see my edit. Thank you for the example, but the problem I tried to describe above is not reflected in your example.
I have updated it to use your example, the main point is not to attempt to split the lines yourself, but to rely on csv.reader() to parse it correctly.
As described above, the StringIO does work, just not with the parameters I am using. I will keep this open for a bit, maybe somebody has an idea, and accept your answer later.
Try adding the newline='' parameter as in my example to your StringIO() call. Your second example seems to parse fine for me.
Using the suggested newline='' for the StringIO (instead of what I was doing, for the CSV parser) worked like a charm. It parses now completely and correctly, as you already saw. Thank you for your time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.