2

I have a CSV file similar to this example below, and I am having trouble trying to read it into python. This file comes from an x-ray reader. It creates mini-tables each time you use it, and groups them into one CSV.

The problem is that each mini-table can come from 2 different methods, and have different number of headers.

ID Method a1 a2 a3 a4 a6 a7 a8 a9
01 Soil   2  3  4  5  3  6  4  8
02 Soil   6  9  2  7  4  3  4  4
03 Soil   5  2  4  9  1  1  2  3

ID Method a1 a5 a6 a7 
01 Datas   5  4  7  8
02 Datas   4  6  3  8
03 Datas   6  3  2  8

ID Method a1 a5 a6 a7 a8 a9 a10 a11
01 Soil   5  4  7  8  2  1   3   4
02 Soil   4  6  3  8  7  2   2   1

ID Method a1 a5 a6 a7 a8 a9
01 Datas   5  4  7  8  2  1
02 Datas   4  6  3  8  7  2
03 Datas   6  3  2  8  9  9
03 Datas   5  3  2  2  7  7

My goal is to split this messy CSV file into two new CSVs depending on the "method" column, containing only one header. Filling with 0s where necessary. Something similar to this:

ID Method a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
01 Soil   2  3  4  5  0  3  6  4  8  0   0
02 Soil   6  9  2  7  0  4  3  4  4  0   0
03 Soil   5  2  4  9  0  1  1  2  3  0   0
01 Soil   5  0  0  0  4  7  8  2  1  3   4
02 Soil   4  0  0  0  6  3  8  7  2  2   1

Any help or ideas on how to approach this problem would be very welcome! I am truly lost on how to solve this.

The original is similar to this:

CSV file GDrive

Thanks in advance!

1
  • 1
    Have you try pd.concat ? with fillna(0) Commented May 20, 2018 at 15:38

2 Answers 2

3

You can use pandas together with collections.defaultdict.

It is good practice to use a dictionary for a variable number of variables.

Solution

from collections import defaultdict

# initialise defaultdict of lists
d = defaultdict(list)

# iterate list of dictionaries and append to appropriate value
for item in [df1, df2, df3]:
    d[item['Method'].iloc[0]].append(item)

# convert to integer
def formatter(df):
    df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
    return df

# use dictionary comprehension to process results for each Method
res = {k: formatter(pd.concat(v).fillna(0)) for k, v in d.items()}

print(res['Soil'])

   ID Method  a1  a10  a11  a2  a3  a4  a5  a6  a7  a8  a9
0   1   Soil   2    0    0   3   4   5   0   3   6   4   8
1   2   Soil   6    0    0   9   2   7   0   4   3   4   4
2   3   Soil   5    0    0   2   4   9   0   1   1   2   3
0   1   Soil   5    3    4   0   0   0   4   7   8   2   1
1   2   Soil   4    2    1   0   0   0   6   3   8   7   2

Setup

You can replace strings with file1.csv, file2.csv, etc.

from io import StringIO

df1_str = StringIO("""ID Method a1 a2 a3 a4 a6 a7 a8 a9
01 Soil   2  3  4  5  3  6  4  8
02 Soil   6  9  2  7  4  3  4  4
03 Soil   5  2  4  9  1  1  2  3""")

df2_str = StringIO("""ID Method a1 a5 a6 a7 
01 Datas   5  4  7  8
02 Datas   4  6  3  8
03 Datas   6  3  2  8""")

df2_str = StringIO("""ID Method a1 a5 a6 a7 a8 a9 a10 a11
01 Soil   5  4  7  8  2  1   3   4
02 Soil   4  6  3  8  7  2   2   1""")

df3_str = StringIO("""ID Method a1 a5 a6 a7 a8 a9
01 Datas   5  4  7  8  2  1
02 Datas   4  6  3  8  7  2
03 Datas   6  3  2  8  9  9
03 Datas   5  3  2  2  7  7""")

df1 = pd.read_csv(df1_str, delim_whitespace=True)
df2 = pd.read_csv(df2_str, delim_whitespace=True)
df3 = pd.read_csv(df3_str, delim_whitespace=True)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much for the great ideas you shared! I will try to implement what you have showed me, I will have to adapt the loading, because in my case, the tables do not come from different csv files, they are glued together on one unique big csv, I will try to split the tables inside the csv in some way... Anyway, thank you very much for showing me the path!
1

We can process the file line by line

import pandas as pd

data = []
with open('pxrf.csv', 'r') as f:
    for s in f.readlines():
        s = s.strip()
        if not s:  # empty line
            header = None
            continue 
        if s.startswith('File #'):  # Header line
            header = s.split(',')
        else:  # Content line
            data.append(dict(zip(header, s.split(','))))
df = pd.DataFrame(data).fillna('0')

You may have to change data types of some columns.

1 Comment

I tried your approach, and it worked incredibly well at first! I was trying to load the entire csv into a pandas dataframe, it was very confusing trying to solve the mess after it was loaded into a dataframe, the line by line process was way more efficient... Thank you very much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.