0

With Python I'm seeking to create a script that compares data in two different csvs. The first csv, filedata.csv, contains strings of filepaths containing information on user names and user ids. The second csv, roster.csv, contains those same fields broken up into different columns. I would like to search through the filepath string in filedata.csv for matches in roster.csv, and then write the columns from roster.csv into filedata.csv. Below are the csv structures, and the desired output.

filedata.csv

filename
C:\johndoe_0001_paper1.doc
C:\janedoe_0002_paper2.doc
C:\johnsmith_0003_paper3.pdf

roster.csv

first_name, last_name, user_id
john, doe, 0001
jane, doe, 0002
john, smith, 0003

Desired output for filedata.csv:

filename, first_name, last_name, user_id
C:\johndoe_0001_paper1.doc, john, doe, 0001
C:\janedoe_0002_paper2.doc, jane, doe, 0002
C:\johnsmith_0003_paper3.pdf, john, smith, 0003

I attempted the following code with Pandas to see if I can search through the strings in filenames.csv for matches from roster.csv:

import pandas as pd

df = pd.read_csv('filenames.csv')
filenames = str(df['filename'])

roster = pd.read_csv('roster.csv')
roster_last_name = str(roster['last_name'])
roster_first_name = str(roster['first_name'])
roster_user_id = str(roster['user_id'])

print(df.loc([filenames]).str.contains([roster_last_name]))

But get the following error:

TypeError: unhashable type: 'list'

Likewise I've tried something simpler, but with no success, as "False" is always returned:

if roster_last_name in filenames:
    print("True")
else:
    print("False")

I'm sure I'm missing something simple, but unsure how to proceed. All suggestions are greatly appreciated.

2
  • Many thanks. While helpful for this case, future iterations of this problem won't have the data in the two csvs lining up exactly as in this case. A search through the string will be necessary. Commented Nov 15, 2021 at 13:09
  • 1
    Well I belive, the exception occurs, because you are using .loc([filename]), but the actual syntax is .loc[filename]. However, because filename = str(df['filename']), filenameis actually the string representing the series object df.filename, which is not a list of filenames. df.filename however is. Commented Nov 15, 2021 at 13:10

2 Answers 2

1
filename['user_id'] = filename['filename'].str.extract(r'(\d{4})')
new_df = filename.merge(roster, on='user_id')

This solution adds a column to filename that is the four-digit ID (as a string) extracted from the filename, and then merges rows from the two dataframes where the user id is the same.

Your solution does not work because expressions like str(roster['last_name']) take a series and returns one string.

Update:

The above solution assumes that the user_id column in roster contains strings. If they are ints, do this:

filename['user_id'] = filename['filename'].str.extract(r'(\d{4})').astype(int)
new_df = filename.merge(roster, on='user_id')

The only difference is .astype(int).

Please let me know if this is not what was wanted.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks very much for the helpful info. However, I'm getting an error message from Pandas indicating I should pd.concat instead of pd.merge (specifically - "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat"). What path might you suggest?
My solution assumes that the user_id column in each dataframe is a string, as your examples show the IDs as four-digit numbers with leading 0s. Leading 0s would not display if the values are encoded as ints. I will update my answer in a moment.
Many thanks for the assistance, and persistence through my incomplete description. All works great! Thanks again.
0

With the following (df1 from filenames.csv and df2 from roster.csv):

for i in df1.index:
    for c in df2.columns:
        df1.loc[df1.filename.str.contains(df2.loc[i, "last_name"] and df2.loc[i, "user_id"].astype(str)), c] = df2.loc[i, c]

This checks for last_name and user_id, because jane and john both have doe as last_name. This gives you the following:

|    | filename                     | first_name   | last_name   |   user_id |
|---:|:-----------------------------|:-------------|:------------|----------:|
|  0 | C:\johndoe_0001_paper1.doc   | john         | doe         |         1 |
|  1 | C:\janedoe_0002_paper2.doc   | jane         | doe         |         2 |
|  2 | C:\johnsmith_0003_paper3.pdf | john         | smith       |         3 |

1 Comment

Iterative solutions should be avoided in pandas as much as possible. It is possible to solve this using the merge method without any Python loops.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.