Python read csv with array columns

Question

I have some csv files including array columns. For example:

a,b,c
1,1|2|3,4.5|5.5|6.5
2,7|8|9,10.5|11.5|12.5

Delimter 1 is , to sepperate fields a, b and c. Delimiter 2 is | in this case, but could be changed.

Is there a possibility in python to read this as a pandas dataframe directly? Field b and c should be an array/series inside the dataframe.

What I do now is reading the csv as strings:

df = pd.read_csv('data.csv', dtype='str')

Then use np.fromstring to convert all strings to numpy arrays:

type_dict = {
  "a": "int",
  "b": "int",
  "c": "float"
}

def make_split(text, dt):
    return np.fromstring(text, sep="|", dtype=dt)

df = df.apply(lambda x: x.apply(make_split, dt=type_dict[x.name]))

But this takes several minutes for my files. Is there a faster option?

frisko · Accepted Answer · 2022-10-21 14:11:13Z

4

You can use converters parameter of .read_csv() to parse columns:

import pandas as pd
import numpy as np


df = pd.read_csv('data.csv', converters={
    'b': lambda x: np.array(x.split('|'), dtype='int'),
    'c': lambda x: np.array(x.split('|'), dtype='float')
    })
print(df)

Output:

   a          b                   c
0  1  [1, 2, 3]     [4.5, 5.5, 6.5]
1  2  [7, 8, 9]  [10.5, 11.5, 12.5]

edited Oct 21, 2022 at 14:11

answered Oct 21, 2022 at 14:09

frisko

8759 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mozway Over a year ago

You might want to define a named function and use that in the dictionary

c111 Over a year ago

This is slightly faster than my version, thank you!

Naveed · Accepted Answer · 2022-10-21 14:32:51Z

1

df=pd.read_csv(r'csv2.txt', sep=',')


df['b']=df['b'].str.split('|').apply(lambda x: [int(i) for i in x])
df['c']=df['c'].str.split('|').apply(lambda x: [float(i) for i in x])
df

a   b   c
0   1   [1, 2, 3]   [4.5, 5.5, 6.5]
1   2   [7, 8, 9]   [10.5, 11.5, 12.5]

edited Oct 21, 2022 at 14:32

answered Oct 21, 2022 at 14:07

Naveed

11.7k2 gold badges16 silver badges21 bronze badges

2 Comments

c111 Over a year ago

This way I still get strings which aren't numpy lists. The conversion to the correct types takes the most time.

c111 Over a year ago

I updated it to use a numpy array and work for a generic dataframe: df = df.apply(lambda x: x.str.split("|").apply(lambda y: np.array(y, dtype=type_dict[x.name]))) This is the fastest solution at the moment.

Collectives™ on Stack Overflow

Python read csv with array columns

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related