2

Hi I have a "sudo" csv file that looks something like this:

id, Wave ID, Time Stamp, Number of Samples, Sample Data Array
123, 317, 1567191561.8044672, 128, 79, 17, 162, 165, 66, 3, 40, 191, 68, 56, 59, 142, 143, 7, 150, 14, 120, 172, 76, 167, 55, 27, 198, 115, 50, 87, 38, 185, 199, 74, 43, 4, 133, 114, 89, 10, 136, 46, 85, 187, 182, 170, 149, 9, 25, 128, 39, 175, 102, 45, 33, 35, 129, 156, 20, 118, 108, 72, 111, 99, 122, 140, 93, 155, 54, 63, 189, 173, 171, 134, 163, 159, 91, 193, 64, 8, 97, 34, 80, 11, 121, 145, 190, 135, 144, 31, 29, 179, 125, 116, 196, 67, 152, 112, 148, 103, 132, 106, 78, 75, 28, 174, 119, 98, 110, 86, 123, 141, 84, 83, 178, 12, 169, 113, 48, 131, 52, 180, 100, 117, 6, 77, 69, 146, 18, 157, 127, 164
123, 20,  1567191562.0020044, 16, 779, 788, 801, 817, 835, 855, 875, 895, 916, 933, 946, 956, 963, 965, 962, 952
123, 20,  1567191561.8064446, 0,
123, 317, 1567191561.8044672, 100, 132, 48, 195, 78, 190, 124, 38, 99, 87, 1, 66, 6, 106, 18, 180, 197, 59, 148, 41, 128, 125, 194, 175, 81, 21, 115, 184, 30, 71, 77, 166, 3, 107, 114, 52, 55, 186, 5, 103, 145, 19, 8, 69, 64, 122, 90, 129, 83, 165, 79, 178, 2, 14, 74, 25, 133, 147, 158, 75, 146, 20, 140, 101, 97, 10, 143, 88, 50, 168, 112, 118, 9, 137, 155, 24, 89, 144, 16, 13, 156, 196, 113, 183, 34, 120, 142, 130, 49, 86, 46, 138, 191, 192, 189, 70, 123, 159, 108, 7, 95

So the first 4 columns are a normal csv then whatever remains is a list of some length. The "Number of Samples" column denotes how long the list is and each line is ended with a new line character.

The Final dataframe would look something like:

id, Wave ID, Time Stamp, Sample Data Array
123, 317, 1567191561.8044672, [1,2,3,4,5,...]
123, 317, 1567191561.8044672, [1,2,3,4,5,...]
123, 20, 1567191561.8044672, []
123, 317, 1567191563.8044672, [1,2,3,4]

Is there some way to import this using read_csv in pandas or something else? I wrote a simple parser that reads the file line by line but its pretty slow. Would prefer to have a pandas dataframe at the end so I can do some group by/sorting on the columns.

Thanks

1 Answer 1

3

You can read in the data as one column (using a separator that is guaranteed to not exist in the data) and then split into 5 columns. You can then remove the last but one column and convert the last column into a list:

import pandas as pd
import io
import datetime

s="""id, Wave ID, Time Stamp, Number of Samples, Sample Data Array
123, 317, 1567191561.8044672, 128, 79, 17, 162, 165, 66, 3, 40, 191, 68, 56, 59, 142, 143, 7, 150, 14, 120, 172, 76, 167, 55, 27, 198, 115, 50, 87, 38, 185, 199, 74, 43, 4, 133, 114, 89, 10, 136, 46, 85, 187, 182, 170, 149, 9, 25, 128, 39, 175, 102, 45, 33, 35, 129, 156, 20, 118, 108, 72, 111, 99, 122, 140, 93, 155, 54, 63, 189, 173, 171, 134, 163, 159, 91, 193, 64, 8, 97, 34, 80, 11, 121, 145, 190, 135, 144, 31, 29, 179, 125, 116, 196, 67, 152, 112, 148, 103, 132, 106, 78, 75, 28, 174, 119, 98, 110, 86, 123, 141, 84, 83, 178, 12, 169, 113, 48, 131, 52, 180, 100, 117, 6, 77, 69, 146, 18, 157, 127, 164
123, 20,  1567191562.0020044, 16, 779, 788, 801, 817, 835, 855, 875, 895, 916, 933, 946, 956, 963, 965, 962, 952
123, 20,  1567191561.8064446, 0,
123, 317, 1567191561.8044672, 100, 132, 48, 195, 78, 190, 124, 38, 99, 87, 1, 66, 6, 106, 18, 180, 197, 59, 148, 41, 128, 125, 194, 175, 81, 21, 115, 184, 30, 71, 77, 166, 3, 107, 114, 52, 55, 186, 5, 103, 145, 19, 8, 69, 64, 122, 90, 129, 83, 165, 79, 178, 2, 14, 74, 25, 133, 147, 158, 75, 146, 20, 140, 101, 97, 10, 143, 88, 50, 168, 112, 118, 9, 137, 155, 24, 89, 144, 16, 13, 156, 196, 113, 183, 34, 120, 142, 130, 49, 86, 46, 138, 191, 192, 189, 70, 123, 159, 108, 7, 95"""

tmp = pd.read_csv(io.StringIO(s), sep='§', engine='python')
df = tmp.iloc[:,0].str.split(', *', 4, expand=True)
df.columns = [c.strip() for c in tmp.columns[0].split(',')]
df = df.drop('Number of Samples', 1)

df.id = df.id.astype(int)
df['Wave ID'] = df['Wave ID'].astype(int)
df['Time Stamp'] = df['Time Stamp'].astype(float).map(datetime.datetime.fromtimestamp)
df['Sample Data Array'] = df['Sample Data Array'].str.split(', *')

Result:

    id  Wave ID                 Time Stamp                                  Sample Data Array
0  123      317 2019-08-30 20:59:21.804467  [79, 17, 162, 165, 66, 3, 40, 191, 68, 56, 59,...
1  123       20 2019-08-30 20:59:22.002004  [779, 788, 801, 817, 835, 855, 875, 895, 916, ...
2  123       20 2019-08-30 20:59:21.806444                                                 []
3  123      317 2019-08-30 20:59:21.804467  [132, 48, 195, 78, 190, 124, 38, 99, 87, 1, 66...
Sign up to request clarification or add additional context in comments.

1 Comment

Works for the most part. Having some issues handling lines that may be malformed but I think this will let me do what I want until I make data output better

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.