Reading a variable column multiple header csv file in python

Question

I have a CSV file similar to this example below, and I am having trouble trying to read it into python. This file comes from an x-ray reader. It creates mini-tables each time you use it, and groups them into one CSV.

The problem is that each mini-table can come from 2 different methods, and have different number of headers.

ID Method a1 a2 a3 a4 a6 a7 a8 a9
01 Soil   2  3  4  5  3  6  4  8
02 Soil   6  9  2  7  4  3  4  4
03 Soil   5  2  4  9  1  1  2  3

ID Method a1 a5 a6 a7 
01 Datas   5  4  7  8
02 Datas   4  6  3  8
03 Datas   6  3  2  8

ID Method a1 a5 a6 a7 a8 a9 a10 a11
01 Soil   5  4  7  8  2  1   3   4
02 Soil   4  6  3  8  7  2   2   1

ID Method a1 a5 a6 a7 a8 a9
01 Datas   5  4  7  8  2  1
02 Datas   4  6  3  8  7  2
03 Datas   6  3  2  8  9  9
03 Datas   5  3  2  2  7  7

My goal is to split this messy CSV file into two new CSVs depending on the "method" column, containing only one header. Filling with 0s where necessary. Something similar to this:

ID Method a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
01 Soil   2  3  4  5  0  3  6  4  8  0   0
02 Soil   6  9  2  7  0  4  3  4  4  0   0
03 Soil   5  2  4  9  0  1  1  2  3  0   0
01 Soil   5  0  0  0  4  7  8  2  1  3   4
02 Soil   4  0  0  0  6  3  8  7  2  2   1

Any help or ideas on how to approach this problem would be very welcome! I am truly lost on how to solve this.

The original is similar to this:

CSV file GDrive

Thanks in advance!

Have you try pd.concat ? with fillna(0)

BENY
– BENY

2018-05-20 15:38:33 +00:00
Commented May 20, 2018 at 15:38 — BENY
– BENY, Commented May 20, 2018 at 15:38

jpp · Accepted Answer · 2018-05-20 18:12:46Z

You can use pandas together with collections.defaultdict.

It is good practice to use a dictionary for a variable number of variables.

Solution

from collections import defaultdict

# initialise defaultdict of lists
d = defaultdict(list)

# iterate list of dictionaries and append to appropriate value
for item in [df1, df2, df3]:
    d[item['Method'].iloc[0]].append(item)

# convert to integer
def formatter(df):
    df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
    return df

# use dictionary comprehension to process results for each Method
res = {k: formatter(pd.concat(v).fillna(0)) for k, v in d.items()}

print(res['Soil'])

   ID Method  a1  a10  a11  a2  a3  a4  a5  a6  a7  a8  a9
0   1   Soil   2    0    0   3   4   5   0   3   6   4   8
1   2   Soil   6    0    0   9   2   7   0   4   3   4   4
2   3   Soil   5    0    0   2   4   9   0   1   1   2   3
0   1   Soil   5    3    4   0   0   0   4   7   8   2   1
1   2   Soil   4    2    1   0   0   0   6   3   8   7   2

Setup

You can replace strings with file1.csv, file2.csv, etc.

from io import StringIO

df1_str = StringIO("""ID Method a1 a2 a3 a4 a6 a7 a8 a9
01 Soil   2  3  4  5  3  6  4  8
02 Soil   6  9  2  7  4  3  4  4
03 Soil   5  2  4  9  1  1  2  3""")

df2_str = StringIO("""ID Method a1 a5 a6 a7 
01 Datas   5  4  7  8
02 Datas   4  6  3  8
03 Datas   6  3  2  8""")

df2_str = StringIO("""ID Method a1 a5 a6 a7 a8 a9 a10 a11
01 Soil   5  4  7  8  2  1   3   4
02 Soil   4  6  3  8  7  2   2   1""")

df3_str = StringIO("""ID Method a1 a5 a6 a7 a8 a9
01 Datas   5  4  7  8  2  1
02 Datas   4  6  3  8  7  2
03 Datas   6  3  2  8  9  9
03 Datas   5  3  2  2  7  7""")

df1 = pd.read_csv(df1_str, delim_whitespace=True)
df2 = pd.read_csv(df2_str, delim_whitespace=True)
df3 = pd.read_csv(df3_str, delim_whitespace=True)

Thank you very much for the great ideas you shared! I will try to implement what you have showed me, I will have to adapt the loading, because in my case, the tables do not come from different csv files, they are glued together on one unique big csv, I will try to split the tables inside the csv in some way... Anyway, thank you very much for showing me the path!

phi · Accepted Answer · 2018-05-20 15:56:36Z

1

We can process the file line by line

import pandas as pd

data = []
with open('pxrf.csv', 'r') as f:
    for s in f.readlines():
        s = s.strip()
        if not s:  # empty line
            header = None
            continue 
        if s.startswith('File #'):  # Header line
            header = s.split(',')
        else:  # Content line
            data.append(dict(zip(header, s.split(','))))
df = pd.DataFrame(data).fillna('0')

You may have to change data types of some columns.

answered May 20, 2018 at 15:56

phi

11.9k3 gold badges28 silver badges34 bronze badges

1 Comment

WillAgri Over a year ago

I tried your approach, and it worked incredibly well at first! I was trying to load the entire csv into a pandas dataframe, it was very confusing trying to solve the mess after it was loaded into a dataframe, the line by line process was way more efficient... Thank you very much!

Collectives™ on Stack Overflow

Reading a variable column multiple header csv file in python

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related