Python & Pandas - merging csv's based on string search

Question

With Python I'm seeking to create a script that compares data in two different csvs. The first csv, filedata.csv, contains strings of filepaths containing information on user names and user ids. The second csv, roster.csv, contains those same fields broken up into different columns. I would like to search through the filepath string in filedata.csv for matches in roster.csv, and then write the columns from roster.csv into filedata.csv. Below are the csv structures, and the desired output.

filedata.csv

filename
C:\johndoe_0001_paper1.doc
C:\janedoe_0002_paper2.doc
C:\johnsmith_0003_paper3.pdf

roster.csv

first_name, last_name, user_id
john, doe, 0001
jane, doe, 0002
john, smith, 0003

Desired output for filedata.csv:

filename, first_name, last_name, user_id
C:\johndoe_0001_paper1.doc, john, doe, 0001
C:\janedoe_0002_paper2.doc, jane, doe, 0002
C:\johnsmith_0003_paper3.pdf, john, smith, 0003

I attempted the following code with Pandas to see if I can search through the strings in filenames.csv for matches from roster.csv:

import pandas as pd

df = pd.read_csv('filenames.csv')
filenames = str(df['filename'])

roster = pd.read_csv('roster.csv')
roster_last_name = str(roster['last_name'])
roster_first_name = str(roster['first_name'])
roster_user_id = str(roster['user_id'])

print(df.loc([filenames]).str.contains([roster_last_name]))

But get the following error:

TypeError: unhashable type: 'list'

Likewise I've tried something simpler, but with no success, as "False" is always returned:

if roster_last_name in filenames:
    print("True")
else:
    print("False")

I'm sure I'm missing something simple, but unsure how to proceed. All suggestions are greatly appreciated.

Many thanks. While helpful for this case, future iterations of this problem won't have the data in the two csvs lining up exactly as in this case. A search through the string will be necessary. — Daniel Hutchinson
– Daniel Hutchinson, Commented Nov 15, 2021 at 13:09
Well I belive, the exception occurs, because you are using .loc([filename]), but the actual syntax is .loc[filename]. However, because filename = str(df['filename']), filenameis actually the string representing the series object df.filename, which is not a list of filenames. df.filename however is. — Thomas Hilger
– Thomas Hilger, Commented Nov 15, 2021 at 13:10

Steele Farnsworth · Accepted Answer · 2021-11-15 16:26:36Z

1

filename['user_id'] = filename['filename'].str.extract(r'(\d{4})')
new_df = filename.merge(roster, on='user_id')

This solution adds a column to filename that is the four-digit ID (as a string) extracted from the filename, and then merges rows from the two dataframes where the user id is the same.

Your solution does not work because expressions like str(roster['last_name']) take a series and returns one string.

Update:

The above solution assumes that the user_id column in roster contains strings. If they are ints, do this:

filename['user_id'] = filename['filename'].str.extract(r'(\d{4})').astype(int)
new_df = filename.merge(roster, on='user_id')

The only difference is .astype(int).

Please let me know if this is not what was wanted.

edited Nov 15, 2021 at 16:26

answered Nov 15, 2021 at 13:16

Steele Farnsworth

9131 gold badge7 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Daniel Hutchinson Over a year ago

Thanks very much for the helpful info. However, I'm getting an error message from Pandas indicating I should pd.concat instead of pd.merge (specifically - "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat"). What path might you suggest?

Steele Farnsworth Over a year ago

My solution assumes that the user_id column in each dataframe is a string, as your examples show the IDs as four-digit numbers with leading 0s. Leading 0s would not display if the values are encoded as ints. I will update my answer in a moment.

Daniel Hutchinson Over a year ago

Many thanks for the assistance, and persistence through my incomplete description. All works great! Thanks again.

Albo · Accepted Answer · 2021-11-15 13:29:47Z

0

With the following (df1 from filenames.csv and df2 from roster.csv):

for i in df1.index:
    for c in df2.columns:
        df1.loc[df1.filename.str.contains(df2.loc[i, "last_name"] and df2.loc[i, "user_id"].astype(str)), c] = df2.loc[i, c]

This checks for last_name and user_id, because jane and john both have doe as last_name. This gives you the following:

|    | filename                     | first_name   | last_name   |   user_id |
|---:|:-----------------------------|:-------------|:------------|----------:|
|  0 | C:\johndoe_0001_paper1.doc   | john         | doe         |         1 |
|  1 | C:\janedoe_0002_paper2.doc   | jane         | doe         |         2 |
|  2 | C:\johnsmith_0003_paper3.pdf | john         | smith       |         3 |

answered Nov 15, 2021 at 13:29

Albo

1,66414 silver badges27 bronze badges

1 Comment

Steele Farnsworth Over a year ago

Iterative solutions should be avoided in pandas as much as possible. It is possible to solve this using the merge method without any Python loops.

Collectives™ on Stack Overflow

Python & Pandas - merging csv's based on string search

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related