4

I have a csv file of such structure:

Id,Country,Cities
1,Canada,"['Toronto','Ottawa','Montreal']"
2,Italy,"['Rome','Milan','Naples', 'Palermo']"
3,France,"['Paris','Cannes','Lyon']"
4,Spain,"['Seville','Alicante','Barcelona']"

The last column contains a list, but it is represented as a string so that it is treated as a single element. When parsing the file, I need to have this element as a list, not a string. So far I've found the way to convert it:

L = "['Toronto','Ottawa','Montreal']"
seq = ast.literal_eval(L)

Since I'm a newbie in python, my question is -- is this normal way of doing this, or there's a right way to represent lists in CSV so that I don't have to do conversions, or there's a simpler way to convert?

Thanks!

1

3 Answers 3

2

Using ast.literal_eval(...) will work, but it requires special syntax that other CSV-reading software won't recognize, and uses an eval statement which is a red flag.

Using eval can be dangerous, even though in this case you're using the safer literal_eval option which is more restrained than the raw eval function.

Usually what you'll see in CSV files that have many values in a single column is that they'll use a simple delimiter and quote the field.

For instance:

ID,Country,Cities
1,Canada,"Toronto;Ottawa;Montreal"

Then in python, or any other language, it becomes trivial to read without having to resort to eval:

import csv

with open("data.csv") as fobj:
    reader = csv.reader(fobj)
    field_names = next(reader)

    rows = []
    for row in reader:
        row[-1] = row[-1].split(";")
        rows.append(row)

Issues with ast.literal_eval

Even though the ast.literal_eval function is much safer than using a regular eval on user input, it still might be exploitable. The documentation for literal_eval has this warning:

Warning: It is possible to crash the Python interpreter with a sufficiently large/complex string due to stack depth limitations in Python’s AST compiler.

A demonstration of this can be found here:

>>> import ast
>>> ast.literal_eval("()" * 10 ** 6)
[1]    48513 segmentation fault  python

I'm definitely not an expert, but giving a user the ability to crash a program and potentially exploit some obscure memory vulnerability is bad, and in this use-case can be avoided.

If the reason you want to use literal_eval is to get proper typing, and you're positive that the input data is 100% trusted, then I suppose it's fine to use. But, you could always wrap the function to perform some sanity checks:

def sanely_eval(value: str, max_size: int = 100_000) -> object:
    if len(value) > max_size:
        raise ValueError(f"len(value) is greater than the max_size={max_size!r}")
    return ast.literal_eval(value)

But, depending on how you're creating and using the CSV files, this may make the data less portable, since it's a python-specific format.

Sign up to request clarification or add additional context in comments.

4 Comments

thanks for feedback. Did you probably mean rows.append(row[-1].split(";")) because otherwise it complains AttributeError: 'list' object has no attribute 'split'
@Mark Thank you for spotting that! I've updated my example. The code is slightly different from yours so that it preserves the whole row, and not just the cities column.
could you explain why eval statement is a big red flag? Is it considered dangerous?
@Mark I've added detail to my answer. I'm assuming you meant literal_eval, rather than eval. If you want info about why calling eval("user input...") is dangerous, look here: nedbatchelder.com/blog/201206/eval_really_is_dangerous.html
2

If you can control the CSV, you could separate the items with some other known character that isn't going to be in a city and isn't a comma. Say colon (:).

Then row one, for example, would look like this:

1,Canada,Toronto:Ottawa:Montreal

When it comes to processing the data, you'll have that whole element, and you can just do

cities.split(':')

If you want to go the other way (you have the cities in a Python list, and you want to create this string) you can use join()

':'.join(['Toronto', 'Ottawa', 'Montreal'])

1 Comment

Separating with special characters is inventing a wheel. CSV standard has notion for double-quotes that handles delimiters inside. For TS there's no difference if it's .split or deserializing json
0

For the specific structure of the csv, you could convert cities to list like this:

cities = '''"['Rome','Milan','Naples', 'Palermo']"'''

cities = cities[2:-2]  # remove "[ and ]"

print(cities)  # 'Rome','Milan','Naples', 'Palermo'

cities = cities.split(',')  # convert to list

print(cities)  # ["'Rome'", "'Milan'", "'Naples'", " 'Palermo'"]

cities = [x.strip() for x in cities]  # remove leading or following spaces (if exists)

print(cities)  # ["'Rome'", "'Milan'", "'Naples'", "'Palermo'"]

cities = [x[1:-1] for x in cities]  # remove quotes '' from each city

print(cities)  # ['Rome', 'Milan', 'Naples', 'Palermo']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.