1

This question follows my previous question output multiple files based on column value python pandas but this time i want to go a bit further.

so this time i have a small sample data set:

import pandas as pd

df = {'ID': ['H900','H901','H902','M1436','M1435','M149','M157','M213','M699','M920','M871','M789','M617','M991','H903','M730','M191'],
  'CloneID': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
  'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}

df = pd.DataFrame(df)

it looks like:

df
Out[6]: 
    CloneID   ID  Length
0       0   H900      48
1       1   H901      42
2       2   H902      48
3       2   M1436     48
4       2   M1435     48
5       2   M149      48
6       2   M157      48
7       2   M213      48
8       3   M699      48
9       3   M920      48
10      3   M871      48
11      4   M789      48
12      4   M617      48
13      4   M991      48
14      5   H903      48
15      5   M730      48
16      6   M191      48

I want to output each 'cloneID' to a different output file, but this time ONLY the ones that contains IDs that starts with "H".

so my desired output, 4 output files:

first file would be 'cloneID0.txt'

    CloneID   ID  Length
      0      H900      48

second file would be 'CloneID1.txt'

    CloneID   ID  Length
      1      H901      42

third file would be 'CloneID2.txt'

    CloneID   ID  Length
       2     H902      48
       2     M1436     48
       2     M1435     48
       2     M149      48
       2     M157      48
       2     M213      48

second file would be 'CloneID5.txt'

    CloneID   ID  Length
      5     H903      48
      5     M730      48

so there would be no 'CloneID3.txt', 'CloneID4.txt' and 'CloneID6.txt' because those clones do not have any ID that starts with"H".

my code:

import pandas as pd
data = pd.read_csv('data.txt', sep = '\t')
gp = data.groupby('CloneID')
for g in gp.groups:
    for s in data.ID:
        if s.startswith("H"):
           path = 'IgHCloneID' + str(g) + '.xlsx'
           gp.get_group(g).to_excel(path, index=False)

It still gave all clone files, not just the ones that contains the IDs that starts with "H".

2 Answers 2

3

You can first filter by condition in any value in column ID startswith 'H' and last groupby with to_csv:

df1 = (df.groupby('CloneID').filter(lambda x: (x.ID.str.startswith("H").any())))

df1.groupby('CloneID').apply(lambda x: x.to_csv('CloneID{}.txt'.format(x.name), index=False))
Sign up to request clarification or add additional context in comments.

Comments

0

You can groupby CloneID and write to csv directly in the apply method:

df.groupby('CloneID').apply(lambda gp: gp.to_csv('CloneID{}.txt'.format(gp.name)))

This will retain the original index, but it can be fixed by .set_index('CloneID') before to_csv call.

Edit: To retain only the groups where corresponding ID starts with H:

This requires a check for each group; here is one approach:

df.groupby('CloneID').apply(
    lambda gp: gp.to_csv('CloneID{}.txt'.format(gp.name))
    if any(gp.ID.str.startswith('H'))
    else None)

1 Comment

This seems to replicate your desired output.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.