0

Objective: create a new column in a csv if an exact match to a search entry (from a list of search entries) is found I am new to python so I apologise if this is confusing.

Below I create some mock-csv files for example. I have an "database" csv file where I use the column headers to create patterns of strings:

#!/usr/bin/env python

import pandas as pd
import regex as re
import numpy as np

#creating database example for stack overflow
data = [['Chicken','Chicken Breast'],
        ['Cattle', 'Beef'],
        ['Bird']]
database = pd.DataFrame(data, columns = ['Animal', 'Meat'])
database.to_csv('db.csv')
db = pd.read_csv('db.csv')

I also have csv files of data that include a source column which I want to search.

data_to_search = [['ID1', 'Chicken'],
                  ['ID2', 'Chicken Breast'],
                  ['ID3', 'Cat'],
                  ['ID4', 'Unknown']]
search_df = pd.DataFrame(data_to_search, columns=['Identifier','Source'])
search_df.to_csv('info.csv')

below is an example of my ugly code

#use the column headers in the source database csv file to create lists and patterns
Animal = db.Animal.tolist()
Animalpattern = "|".join(str(v) for v in Animal)

Meat = db.Meat.tolist()
Meatpattern = "|".join(str(v) for v in Meat)


#read the input file that will be searched to source parses from
search_data = pd.read_csv('info.csv')

#search through the source column in the input file, and identify matches to the patterns from the database csv, then create new columns for matches
search_data['Animal'] = search_data['Source'].str.match(Animalpattern)
search_data['Animal'] = search_data['Animal'].map({True: 'Animal', False: ''})

search_data['Meat'] = search_data['Source'].str.match(Meatpattern)
search_data['Meat'] = search_data['Meat'].map({True: 'Meat', False: ''})

#replacing empty cells with NaN so can concatenate without worrying about extra commas
search_data['Animal'].replace('', np.nan, inplace=True)
search_data['Meat'].replace('', np.nan, inplace=True)

#create a new column that concatenates all of the parsed source information into one
search_data['Source'] = search_data[['Animal', 'Meat']].apply(lambda x: ','.join(x[x.notnull()]), axis=1)

#output a new csv file with source data
search_data.to_csv('output.csv')

The output looks like this:

Unnamed: 0,Identifier,Source,Animal,Meat
0,ID1,Animal,Animal,
1,ID2,"Animal,Meat",Animal,Meat
2,ID3,,,
3,ID4,,,

But I would like to prevent it from outputting "Animal,Meat" where "Chicken Breast" was an entry, as it should only be a match to "Meat" but is also detecting "Chicken":

Unnamed: 0,Identifier,Source,Animal,Meat
0,ID1,Animal,Animal,
1,ID2,Meat,,Meat
2,ID3,,,
3,ID4,,,

I have it working, but I can not figure out how to get an exact match to work, so where it should be just 'Meat' for 'Chicken Breast' I end up with 'Animal,Meat' because 'Chicken' is in 'Chicken Breast'.

My source/database file has hundreds of entries for some columns, so I need a way to read the columns in as lists, then search for the values in those columns.

I have tried to understand if I can use info from: How to match any string from a list of strings in regular expressions in python?

But I am still very new to coding (hence why my code is long and ugly where I'm sure for-loops or something would simplify it).

Any help is appreciated!

2
  • It would be extremely helpful if you could, using the example data provided, show the desired outcome? for example, do you want a dataframe ordered in some way, a list of entries, or a dictionary of some ordered sets of values. Commented Apr 21, 2021 at 18:27
  • I edited the post to include the output and desired change, thank you! Commented Apr 21, 2021 at 18:36

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.