0

My question is about using a regex pattern efficiently to find matches between two pandas df extracted from excel files

First, thanks for taking the time to look at my issue. I am very new to python, and even more so when using it to deal with large datasets. I know that I should use stack as a very last resort.

Now I'm hitting my head against a wall for something that's probably simple.

The problem : I have two excel files :

One is only 1 columns * 300 rows =original urls

The other one can be very large from 20k and more translated urls

The purpose : I have original urls and I have to find their translated counterpart that are in the big excel file.

The only common point between the two urls is an 8 digits number somewhere in the url

I have thought about merging but it won't work since it is a partial match (to the best of my knowledge.) I have found a potentially interesting solution : .where which could allows me to do exactly what I want. However I get this error:

master_list["translated"] = crawlfr.url.where(number_search.search(master_list).group(0) == number_search.search(crawl_fr).group(0), master_list.url) TypeError: expected string or buffer

As I understand this error, the issue could come from regex that doesn't seem to take anything else but a string.

In fact when I compare two strings by extracting the match objects in function like this it works.

def skl_finder(master_list,crawl_fr):
    skl_search=re.compile("\d{8}")
    if skl_search.search(master_list).group(0) == 
    skl_search.search(crawl_fr).group(0):
        return skl_search.search(master_list).group(0)

I think this question is very close to what I want to do but it did not have any reply: Pandas: Comparing two dataframes with identical data structure but differences in data using pattern matching

import regex as re
import pandas as pd
crawl_loc="translated_file_set.xlsx"

master_list_loc="original_fileset.xlsx"

crawlfr=pd.read_excel(crawl_loc,parse_cols="E")

master_list=pd.read_excel(master_list_loc)

number_search=re.compile("\d{8}")

master_list["translated"] = 
crawlfr.url.where(number_search.search(master_list).group(0) == 
number_search.search(crawl_fr).group(0), master_list.url)


master_list.to_excel("result_file.xlsx")
3
  • not clear what your issue is, speed? IMO you could pre-compute search key, that is, extract digits in files and convert to integer as a new column, then search on them. Commented Sep 15, 2017 at 10:19
  • Both speed and how to apply regex between two columns. as mentioned I have errors with regex and I can't find a working solution to use matching pattern between two columns in this case. Commented Sep 15, 2017 at 10:24
  • transfer excel into list of strings, compute search key on them, will easier compare with pandas IMO, which you are not familiar with. Commented Sep 15, 2017 at 12:20

1 Answer 1

1

Make sure your url cols are dtype str.

Try creating new cols with the 8 digit id for each:

crawlfr['url_id'] = crawlfr.url.str.extract("\d{8}")
master_list['url_id'] = master_list.url.str.extract("\d{8}")

Then join on url_id:

crawlfr.join(master_list, on='url_id')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.