0

So I am back with another question about python and pandas. I have table1 with following columns:

ID;COUNT;FOREIGN_ID;OTHER_DATA
1;3;xyz1
2;1;xyz2
3;1;xyz3

table2

ID;FOREIGN_ID;OTHER_DATA
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
6;xyz2;000000
7;xyz2;000000
8;xyz3;000000
9;xyz3;000000

Both tables are stored as CSV files. I load both of them into dataframe, and then iterate through TABLE1. I must find all records in table2 with same record and randomly select some of them.

df_result = pd.DataFrame()
df_table1 = pd.read_csv(table1, delimiter=';')
df_table2 = pd.read_csv(table2, delimiter=';')

for index, row in df_table1 .iterrows():
    df_candidates = df_table2[(df_table2['FOREIGN_ID'] == row['FOREIGN_ID']
    random_numbers = np.random.choice(len(df_kandidati), row['count'], replace=False)
    df_result.append(df_candidates.iloc[random_numbers])

In my earlier question I got an answer that using For loop is big time waster... But for this problem I can't find a solution where I wouldn't need to use for loop.

EDIT: I am sorry for editing my question so late.. was busy with other stuff... As requested below is the result_table. Please note that my real tables are slightly different than those below. I am joining tables on 3 foreign keys in my real use but for demonstration, I am using tables with fake data.

So the logic should be something like this: Read the first line of table1.

1;3;xyz1

Find all records with same FOREIGN_ID in table2 count = 3, foreign_id = xyz1 Rows with foreign_id = xyz1 are rows:

1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005

Because count = 3 I must randomly choose 3 of those records. I do this with the following line: df_candidates is table of all suitable records (table above)

random_numbers = np.random.choice(len(df_candidates), row['count'], replace=False)

Then I store randomly chosen records in a df_result after parsing all rows from table1 I write df_result to the csv.

Problem is that my tables are 0.5milion - 1 milion rows big so iterating through every row in table1 is really slow... And I am sure there is a better way of doing this.. But I've been stuck on this for past 2 days so..

5
  • Can you show us the expected output? Commented May 24, 2018 at 14:11
  • you may use pandas merge Commented May 24, 2018 at 14:12
  • table1.merge(table2, on='FOREIGN_ID') and then randomly select rows. Commented May 24, 2018 at 14:13
  • 1
    What do you mean by "same record"? Commented May 24, 2018 at 14:13
  • I edited my question to further clarify my problem. please take a look. Commented May 25, 2018 at 10:33

2 Answers 2

1

To select rows, containing only values from Table1, you can use, for example, pd.merge :

col = "FOREIGN_ID"

left = df_table2
right = df_table1[[col]]
filtered = pd.merge(left=left, right=right, on=col, how="inner")

Or df.isin():

ix = df_table2[col].isin(df_table1[col])
filtered = df_table2[ix]

Then to select random sample per group:

def select_random_row(grp):
    choice = np.random.randint(len(grp))
    return grp.iloc[choice]

filtered.groupby(col).apply(select_random_row)
Sign up to request clarification or add additional context in comments.

1 Comment

thanks for your answer. It does exactly what I wanted I just had to change select_random_row(grp) function to select #number of rows instead of only 1.
1

Have you looked into using pd.merge()

Your call would look something like:

results=pd.merge(table1, table2, how='inner', on='FOREIGN_ID')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.