How to match a regex pattern between two columns in pandas dataframe efficiently?

Question

My question is about using a regex pattern efficiently to find matches between two pandas df extracted from excel files

First, thanks for taking the time to look at my issue. I am very new to python, and even more so when using it to deal with large datasets. I know that I should use stack as a very last resort.

Now I'm hitting my head against a wall for something that's probably simple.

The problem : I have two excel files :

One is only 1 columns * 300 rows =original urls

The other one can be very large from 20k and more translated urls

The purpose : I have original urls and I have to find their translated counterpart that are in the big excel file.

The only common point between the two urls is an 8 digits number somewhere in the url

I have thought about merging but it won't work since it is a partial match (to the best of my knowledge.) I have found a potentially interesting solution : .where which could allows me to do exactly what I want. However I get this error:

master_list["translated"] = crawlfr.url.where(number_search.search(master_list).group(0) == number_search.search(crawl_fr).group(0), master_list.url) TypeError: expected string or buffer

As I understand this error, the issue could come from regex that doesn't seem to take anything else but a string.

In fact when I compare two strings by extracting the match objects in function like this it works.

def skl_finder(master_list,crawl_fr):
    skl_search=re.compile("\d{8}")
    if skl_search.search(master_list).group(0) == 
    skl_search.search(crawl_fr).group(0):
        return skl_search.search(master_list).group(0)

I think this question is very close to what I want to do but it did not have any reply: Pandas: Comparing two dataframes with identical data structure but differences in data using pattern matching

import regex as re
import pandas as pd
crawl_loc="translated_file_set.xlsx"

master_list_loc="original_fileset.xlsx"

crawlfr=pd.read_excel(crawl_loc,parse_cols="E")

master_list=pd.read_excel(master_list_loc)

number_search=re.compile("\d{8}")

master_list["translated"] = 
crawlfr.url.where(number_search.search(master_list).group(0) == 
number_search.search(crawl_fr).group(0), master_list.url)


master_list.to_excel("result_file.xlsx")

not clear what your issue is, speed? IMO you could pre-compute search key, that is, extract digits in files and convert to integer as a new column, then search on them. — georgexsh
– georgexsh, Commented Sep 15, 2017 at 10:19
Both speed and how to apply regex between two columns. as mentioned I have errors with regex and I can't find a working solution to use matching pattern between two columns in this case. — Thomas meriaux
– Thomas meriaux, Commented Sep 15, 2017 at 10:24
transfer excel into list of strings, compute search key on them, will easier compare with pandas IMO, which you are not familiar with. — georgexsh
– georgexsh, Commented Sep 15, 2017 at 12:20

Kyle · Accepted Answer · 2017-09-15 15:44:20Z

1

Make sure your url cols are dtype str.

Try creating new cols with the 8 digit id for each:

crawlfr['url_id'] = crawlfr.url.str.extract("\d{8}")
master_list['url_id'] = master_list.url.str.extract("\d{8}")

Then join on url_id:

crawlfr.join(master_list, on='url_id')

answered Sep 15, 2017 at 15:44

Kyle

2,9342 gold badges21 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to match a regex pattern between two columns in pandas dataframe efficiently?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related