0

I'm having trouble on how to match regex in two different dataframe that is linked with its type and unique country. Here is the sample for the data df and the regex df. Note that the shape for these two dataframe is different because the regex df containing just unique value.

             **Data df**                                          **Regex df**

  **Country    Type      Data**                       **Country    Type       Regex**
      MY       ABC     MY1234567890                        MY       ABC    ^MY[0-9]{10}
      IT       ABC     IT1234567890                        IT       ABC    ^IT[0-9]{10}
      PL       PQR     PL123456                            PL       PQR    ^PL
      MY       XYZ     456792abc                           MY       XYZ    ^\w{6,10}$
      IT       ABC     MY45889976
      IT       ABC     IT567888976

I have tried to merge them together and just use lambda to do the matching. Below is my code,

df.merge(df_regex,left_on='Country',right_on="Country")
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)

But, it will add another row for each of the different type and country. So there will be a lot of duplication which is not efficient and time consuming.

Is there any pythonic way to match the data to its country and type but the reference is in another dataframe. without merging those 2 df. Then if its match to its own regex, it will return 1, else 0.

2
  • 1
    Could you include your desired output? A new column in Data df holding ones and zeros? Commented Apr 20, 2020 at 9:10
  • yes @JvdV. Would love to know if there is other way than concatenating Commented Apr 20, 2020 at 14:38

1 Answer 1

2

To avoid repetition based on Type we should include Type also in the joining conditions, Now apply the lambda

df2 = df.merge(df_regex, left_on=['Country', 'Type'],right_on=['Country', 'Type'])
df2['Data Quality'] = df2.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
df2

It will give you the following output.

Country Type          Data         Regex  Data Quality
0      MY  ABC  MY1234567890  ^MY[0-9]{10}             1
1      IT  ABC  IT1234567890  ^IT[0-9]{10}             1
2      IT  ABC    MY45889976  ^IT[0-9]{10}             0
3      IT  ABC   IT567888976  ^IT[0-9]{10}             0
4      PL  PQR      PL123456           ^PL             1
5      MY  XYZ     456792abc    ^\w{6,10}$             1
Sign up to request clarification or add additional context in comments.

1 Comment

thanks but after the merging, there are some data that is omitted as the shape is reduced. anyway, is there any other way without merging those two df?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.