Import CSV into pandas dataframe with list as column

Question

Hi I have a "sudo" csv file that looks something like this:

id, Wave ID, Time Stamp, Number of Samples, Sample Data Array
123, 317, 1567191561.8044672, 128, 79, 17, 162, 165, 66, 3, 40, 191, 68, 56, 59, 142, 143, 7, 150, 14, 120, 172, 76, 167, 55, 27, 198, 115, 50, 87, 38, 185, 199, 74, 43, 4, 133, 114, 89, 10, 136, 46, 85, 187, 182, 170, 149, 9, 25, 128, 39, 175, 102, 45, 33, 35, 129, 156, 20, 118, 108, 72, 111, 99, 122, 140, 93, 155, 54, 63, 189, 173, 171, 134, 163, 159, 91, 193, 64, 8, 97, 34, 80, 11, 121, 145, 190, 135, 144, 31, 29, 179, 125, 116, 196, 67, 152, 112, 148, 103, 132, 106, 78, 75, 28, 174, 119, 98, 110, 86, 123, 141, 84, 83, 178, 12, 169, 113, 48, 131, 52, 180, 100, 117, 6, 77, 69, 146, 18, 157, 127, 164
123, 20,  1567191562.0020044, 16, 779, 788, 801, 817, 835, 855, 875, 895, 916, 933, 946, 956, 963, 965, 962, 952
123, 20,  1567191561.8064446, 0,
123, 317, 1567191561.8044672, 100, 132, 48, 195, 78, 190, 124, 38, 99, 87, 1, 66, 6, 106, 18, 180, 197, 59, 148, 41, 128, 125, 194, 175, 81, 21, 115, 184, 30, 71, 77, 166, 3, 107, 114, 52, 55, 186, 5, 103, 145, 19, 8, 69, 64, 122, 90, 129, 83, 165, 79, 178, 2, 14, 74, 25, 133, 147, 158, 75, 146, 20, 140, 101, 97, 10, 143, 88, 50, 168, 112, 118, 9, 137, 155, 24, 89, 144, 16, 13, 156, 196, 113, 183, 34, 120, 142, 130, 49, 86, 46, 138, 191, 192, 189, 70, 123, 159, 108, 7, 95

So the first 4 columns are a normal csv then whatever remains is a list of some length. The "Number of Samples" column denotes how long the list is and each line is ended with a new line character.

The Final dataframe would look something like:

id, Wave ID, Time Stamp, Sample Data Array
123, 317, 1567191561.8044672, [1,2,3,4,5,...]
123, 317, 1567191561.8044672, [1,2,3,4,5,...]
123, 20, 1567191561.8044672, []
123, 317, 1567191563.8044672, [1,2,3,4]

Is there some way to import this using read_csv in pandas or something else? I wrote a simple parser that reads the file line by line but its pretty slow. Would prefer to have a pandas dataframe at the end so I can do some group by/sorting on the columns.

Thanks

Stef · Accepted Answer · 2019-12-09 20:30:56Z

You can read in the data as one column (using a separator that is guaranteed to not exist in the data) and then split into 5 columns. You can then remove the last but one column and convert the last column into a list:

import pandas as pd
import io
import datetime

s="""id, Wave ID, Time Stamp, Number of Samples, Sample Data Array
123, 317, 1567191561.8044672, 128, 79, 17, 162, 165, 66, 3, 40, 191, 68, 56, 59, 142, 143, 7, 150, 14, 120, 172, 76, 167, 55, 27, 198, 115, 50, 87, 38, 185, 199, 74, 43, 4, 133, 114, 89, 10, 136, 46, 85, 187, 182, 170, 149, 9, 25, 128, 39, 175, 102, 45, 33, 35, 129, 156, 20, 118, 108, 72, 111, 99, 122, 140, 93, 155, 54, 63, 189, 173, 171, 134, 163, 159, 91, 193, 64, 8, 97, 34, 80, 11, 121, 145, 190, 135, 144, 31, 29, 179, 125, 116, 196, 67, 152, 112, 148, 103, 132, 106, 78, 75, 28, 174, 119, 98, 110, 86, 123, 141, 84, 83, 178, 12, 169, 113, 48, 131, 52, 180, 100, 117, 6, 77, 69, 146, 18, 157, 127, 164
123, 20,  1567191562.0020044, 16, 779, 788, 801, 817, 835, 855, 875, 895, 916, 933, 946, 956, 963, 965, 962, 952
123, 20,  1567191561.8064446, 0,
123, 317, 1567191561.8044672, 100, 132, 48, 195, 78, 190, 124, 38, 99, 87, 1, 66, 6, 106, 18, 180, 197, 59, 148, 41, 128, 125, 194, 175, 81, 21, 115, 184, 30, 71, 77, 166, 3, 107, 114, 52, 55, 186, 5, 103, 145, 19, 8, 69, 64, 122, 90, 129, 83, 165, 79, 178, 2, 14, 74, 25, 133, 147, 158, 75, 146, 20, 140, 101, 97, 10, 143, 88, 50, 168, 112, 118, 9, 137, 155, 24, 89, 144, 16, 13, 156, 196, 113, 183, 34, 120, 142, 130, 49, 86, 46, 138, 191, 192, 189, 70, 123, 159, 108, 7, 95"""

tmp = pd.read_csv(io.StringIO(s), sep='§', engine='python')
df = tmp.iloc[:,0].str.split(', *', 4, expand=True)
df.columns = [c.strip() for c in tmp.columns[0].split(',')]
df = df.drop('Number of Samples', 1)

df.id = df.id.astype(int)
df['Wave ID'] = df['Wave ID'].astype(int)
df['Time Stamp'] = df['Time Stamp'].astype(float).map(datetime.datetime.fromtimestamp)
df['Sample Data Array'] = df['Sample Data Array'].str.split(', *')

Result:

    id  Wave ID                 Time Stamp                                  Sample Data Array
0  123      317 2019-08-30 20:59:21.804467  [79, 17, 162, 165, 66, 3, 40, 191, 68, 56, 59,...
1  123       20 2019-08-30 20:59:22.002004  [779, 788, 801, 817, 835, 855, 875, 895, 916, ...
2  123       20 2019-08-30 20:59:21.806444                                                 []
3  123      317 2019-08-30 20:59:21.804467  [132, 48, 195, 78, 190, 124, 38, 99, 87, 1, 66...

Works for the most part. Having some issues handling lines that may be malformed but I think this will let me do what I want until I make data output better

Collectives™ on Stack Overflow

Import CSV into pandas dataframe with list as column

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related