Parsing CSV string with CSV module

Question

I'm getting CSV formatted data piped from an external source, and the data should not be written to a file, that would open a number of different maintenance tasks I would like to avoid. I am getting the data as a string.

So now I want to interpret the data as CSV. The python module csv is suited for that, so I am using it.

However, constructing a CSV parser using csv.reader (or dictreader) requires the data to support an iterator. A list or a reader would already do that. However, my data contains line breaks (\n) inside quotes strings (which is totally fine for the CSV format), and also line breaks as record end indicators/line terminators.

If I construct it like this:

csvreader = csv.reader(data.split('\n'))

then I have already broken the valid CSV formatting by interpreting the line breaks as record end indicators, even if they were inside quoted strings. For me, the interpretation of the line breaks should be done by the CSV parser, as it is aware of quotes fields containing CSV relevant formatting characters like line breaks.

Do I have a misunderstanding with the way the csv package parser works? Or how would I parse a CSV type string using the CSV parser while the data contains line breaks that do not indicate the end of a record?

Edit 1: It might not have been wise to add context to my question, as the context seems to have overtaken my actual question/problem. To clarify, my problem is not parsing CSV from memory. I have found StringIO already. When using StringIO, all that does is use the whole string/file as the first record.

Edit 2: Maybe sample data will help. My data looks like this:

"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"

So inside the 3rd entry in the first record I have a line break character, which should be interpreted as part of the third entry in the first record.

Edit 3: The way I currently use the parser is this:

csvreader = csv.reader(StringIO(result), quotechar='"', delimiter=';', lineterminator='\n', escapechar='"')

because the spec says that a double quote (") inside a field is escaped by a quote char (""), so the data might look like this:

"value1";"value2";"value3
stillvalue3
a ""quote"" inside the quote"
"value4";"value5";"value6"

It seems that using these settings the StringIO input always results in one record with all the data from the string inside it. Removing the

escapechar='"'

fixes this problem, however, I cannot support escaped double quote characters inside the quotes. So the nature of my question changed. I'll keep this open if anybody has an idea and will accept the answer of Martin Evans later.

The easiest way is to use StringIO (now from io import StringIO) as shown here: stackoverflow.com/a/18724978/3001761 — jonrsharpe
– jonrsharpe, Commented Mar 9, 2021 at 15:20
The problem is that (apart from the escapechar) a csv.reader reads correctly the only data you have shown. So we have to guess what the problem can be. You have to show some data exhibiting the problem, with what you get and what you want. — Serge Ballesta
– Serge Ballesta, Commented Mar 10, 2021 at 8:20

Martin Evans · Accepted Answer · 2021-03-10 08:04:28Z

8

You can use Python's StringIO() to give your data string a stream interface.

Do not attempt to use data.split('\n') as this will fail to split the lines correctly, the csv.reader() will correctly read the lines in, even if cells contain newlines:

from io import StringIO
import csv

data = '''"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"'''

csv_input = csv.reader(StringIO(data, newline=''), delimiter=';')
rows = list(csv_input)
print("Rows", rows)

Giving you:

Rows [['value1', 'value2', 'value3\nstillvalue3'], ['value4', 'value5', 'value6']]

This could then be loaded into Pandas, or directly using StringIO():

columns = ['col1', 'col2', 'col3']

df = pd.DataFrame(rows, columns=columns)

df = pd.read_csv(StringIO(data, newline=''), names=columns, sep=';')

print(df)

Giving:

     col1    col2                 col3
0  value1  value2  value3\nstillvalue3
1  value4  value5               value6

Your second example gives:

Rows [['value1', 'value2', 'value3\nstillvalue3\na "quote" inside the quote'], ['value4', 'value5', 'value6']]

     col1    col2                                             col3
0  value1  value2  value3\nstillvalue3\na "quote" inside the quote
1  value4  value5                                           value6

edited Mar 10, 2021 at 8:04

answered Mar 9, 2021 at 16:07

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

0xCAFEBABE Over a year ago

Please see my edit. Thank you for the example, but the problem I tried to describe above is not reflected in your example.

Martin Evans Over a year ago

I have updated it to use your example, the main point is not to attempt to split the lines yourself, but to rely on csv.reader() to parse it correctly.

0xCAFEBABE Over a year ago

As described above, the StringIO does work, just not with the parameters I am using. I will keep this open for a bit, maybe somebody has an idea, and accept your answer later.

Martin Evans Over a year ago

Try adding the newline='' parameter as in my example to your StringIO() call. Your second example seems to parse fine for me.

0xCAFEBABE Over a year ago

Using the suggested newline='' for the StringIO (instead of what I was doing, for the CSV parser) worked like a charm. It parses now completely and correctly, as you already saw. Thank you for your time.

Collectives™ on Stack Overflow

Parsing CSV string with CSV module

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related