2

I have what I assumed would be a super basic problem, but I'm unable to find a solution. The short is that I have a column in a csv that is a list of numbers. This csv that was generated by pandas with to_csv. When trying to read it back in with read_csv it automatically converts this list of numbers into a string.

When then trying to use it I obviously get errors. When I try using the to_numeric function I get errors as well because it is a list, not a single number.

Is there any way to solve this? Posting code below for form, but probably not extremely helpful:

def write_func(dataset):
    features = featurize_list(dataset[column])  # Returns numpy array
    new_dataset = dataset.copy()  # Don't want to modify the underlying dataframe
    new_dataset['Text'] = features
    new_dataset.rename(columns={'Text': 'Features'}, inplace=True)
    write(new_dataset, dataset_name)

def write(new_dataset, dataset_name):
    dump_location = feature_set_location(dataset_name, self)
    featurized_dataset.to_csv(dump_location)

def read_func(read_location):
    df = pd.read_csv(read_location)
    df['Features'] = df['Features'].apply(pd.to_numeric)

The Features column is the one in question. When I attempt to run the apply currently in read_func I get this error:

ValueError: Unable to parse string "[0.019636873200000002, 0.10695576670000001,...]" at position 0

I can't be the first person to run into this issue, is there some way to handle this at read/write time?

2 Answers 2

2

You want to use literal_eval as a converter passed to pd.read_csv. Below is an example of how that works.

from ast import literal_eval
form io import StringIO
import pandas as pd

txt = """col1|col2
a|[1,2,3]
b|[4,5,6]"""

df = pd.read_csv(StringIO(txt), sep='|', converters=dict(col2=literal_eval))
print(df)

  col1       col2
0    a  [1, 2, 3]
1    b  [4, 5, 6]
Sign up to request clarification or add additional context in comments.

9 Comments

Is this secure? literal_eval sketches me out quite a bit, and I don't have complete control over the input files here. They get pulled down from a remote server.
I'm equally sketched out by eval... literal_eval is intended to alleviate that a fear by being safe parsing of literals. See this post
This seems... doable, but is this really the only way to do it? It's pretty damn arcane for something that feels like a very basic use case. To be clear this does work though.
No, it isn't... the other way is more painful. You can parse the string yourself.
@SlaterTyranus It is not that it's not a common use case but pandas mainly deals with numbers and strings. It doesn't support these kind of structures really well. If they are all lists, you can just use json to parse them (i.e. json.loads('[1.0, 2.0]')) I am not sure if this can be passed as a converter like piRSquared did, but it seems doable.
|
1

I have modified your last function a bit and it works fine.

def read_func(read_location):
    df = pd.read_csv(read_location)
    df['Features'] = df['Features'].apply(lambda x : pd.to_numeric(x))

1 Comment

This is not tractable for me due to performance reasons. It's quite a large file I'm converting and this iterates through every entry in every list.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.