python - pandas dataframe processing

Question

So I am back with another question about python and pandas. I have table1 with following columns:

ID;COUNT;FOREIGN_ID;OTHER_DATA
1;3;xyz1
2;1;xyz2
3;1;xyz3

table2

ID;FOREIGN_ID;OTHER_DATA
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
6;xyz2;000000
7;xyz2;000000
8;xyz3;000000
9;xyz3;000000

Both tables are stored as CSV files. I load both of them into dataframe, and then iterate through TABLE1. I must find all records in table2 with same record and randomly select some of them.

df_result = pd.DataFrame()
df_table1 = pd.read_csv(table1, delimiter=';')
df_table2 = pd.read_csv(table2, delimiter=';')

for index, row in df_table1 .iterrows():
    df_candidates = df_table2[(df_table2['FOREIGN_ID'] == row['FOREIGN_ID']
    random_numbers = np.random.choice(len(df_kandidati), row['count'], replace=False)
    df_result.append(df_candidates.iloc[random_numbers])

In my earlier question I got an answer that using For loop is big time waster... But for this problem I can't find a solution where I wouldn't need to use for loop.

EDIT: I am sorry for editing my question so late.. was busy with other stuff... As requested below is the result_table. Please note that my real tables are slightly different than those below. I am joining tables on 3 foreign keys in my real use but for demonstration, I am using tables with fake data.

So the logic should be something like this: Read the first line of table1.

1;3;xyz1

Find all records with same FOREIGN_ID in table2 count = 3, foreign_id = xyz1 Rows with foreign_id = xyz1 are rows:

1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005

Because count = 3 I must randomly choose 3 of those records. I do this with the following line: df_candidates is table of all suitable records (table above)

random_numbers = np.random.choice(len(df_candidates), row['count'], replace=False)

Then I store randomly chosen records in a df_result after parsing all rows from table1 I write df_result to the csv.

Problem is that my tables are 0.5milion - 1 milion rows big so iterating through every row in table1 is really slow... And I am sure there is a better way of doing this.. But I've been stuck on this for past 2 days so..

table1.merge(table2, on='FOREIGN_ID') and then randomly select rows. — harpan
– harpan, Commented May 24, 2018 at 14:13
I edited my question to further clarify my problem. please take a look. — Kristjan
– Kristjan, Commented May 25, 2018 at 10:33

koPytok · Accepted Answer · 2018-05-24 14:50:39Z

1

To select rows, containing only values from Table1, you can use, for example, pd.merge :

col = "FOREIGN_ID"

left = df_table2
right = df_table1[[col]]
filtered = pd.merge(left=left, right=right, on=col, how="inner")

Or df.isin():

ix = df_table2[col].isin(df_table1[col])
filtered = df_table2[ix]

Then to select random sample per group:

def select_random_row(grp):
    choice = np.random.randint(len(grp))
    return grp.iloc[choice]

filtered.groupby(col).apply(select_random_row)

edited May 24, 2018 at 14:50

answered May 24, 2018 at 14:38

koPytok

3,7631 gold badge16 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kristjan Over a year ago

thanks for your answer. It does exactly what I wanted I just had to change select_random_row(grp) function to select #number of rows instead of only 1.

DWD · Accepted Answer · 2018-05-24 14:15:52Z

1

Have you looked into using pd.merge()

Your call would look something like:

results=pd.merge(table1, table2, how='inner', on='FOREIGN_ID')

answered May 24, 2018 at 14:15

DWD

4726 silver badges17 bronze badges

Collectives™ on Stack Overflow

python - pandas dataframe processing

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related