1

I have 25 data frames which I need to merge and find recurrently occurring rows from all 25 data frames, For example, my data frame looks like following,

df1
chr start   end     name
1   12334   12334   AAA
1   2342    2342    SAP
2   3456    3456    SOS
3   4537    4537    ABR
df2
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   7689    7689    LUF
df3 
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   4537    4537    LUF
3   8976    8976    BAR
4   6789    6789    AIN

And In the end, I am aiming to have an output data frame like following,

chr start   end     name    Sample
1   12334   12334   AAA df1
1   12334   12334   AAA df2
1   12334   12334   AAA df3

I can get there with the following solution, By dictionary which adds all these three data frames into one bigger data frame dfs

dfs = {'df1': df1, 'df2': df2}

Then further,

common_tups = set.intersection(*[set(df[['chr', 'start', 'end']].drop_duplicates().apply(tuple, axis=1).values) for df in dfs.values()])
pd.concat([df[df[['chr', 'start', 'end']].apply(tuple, axis=1).isin(common_tups)].assign(Sample=name) for (name, df) in dfs.items()])

This gives out the resulting data frame with matching rows from all three data frames, but I have 25 data frames which I am calling as list from the directory as following,

path         = 'Fltered_vcfs/' 
files        = os.listdir(path)
results      = [os.path.join(path,i) for i in files if i.startswith('vcf_filtered')]

And so how can I show the list 'results' in the dictionary and proceed further to get the desired output. Any help or suggestions are greatly appreciated.

Thank you

1 Answer 1

1

Using the glob module, you can use

import os
from glob import glob

path = 'Fltered_vcfs' 
f_names = glob(os.path.join(path, 'vcf_filtered*.*')) 

Then, your dictionary can be created with dictionary comprehension using

import pandas as pd

 {os.path.splitext(os.path.split(f_name)[1])[0]: pd.read_csv(f_name,sep='\t') for f_name in f_names}
Sign up to request clarification or add additional context in comments.

7 Comments

There is a problem, my actual datafarmes has more columns, say like 18 for each of them.
@user1017373 Not sure what the problem is. If you'd like to load only a specific subset of the columns, you can use usecols=['start', 'end', 'name']. Is that your problem?
No, loading whole columns and then merging based on matching rows from those three columns : [['chr', 'start', 'end']., the question is same, but the input slightly differs, and with the above solution its throwing index error message even after stripping the whitespace for all data frames as, KeyError: "['chr' 'start' 'end'] not in index"
@user1017373 But then it looks like they're just not in the index, no? So it doesn't look like your input matches the the specifications of your algorithm.
Actually the name of the column is different from what I specify in the demo algorithm and that I have changed when I am using the same for real data frames. Like the real dfs have columns which I needed are as , chrom, start and end
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.