1

I'm trying to use the re library to use regular expression to substitute text. However, even after I use astype(str), the column that I want to change is still being stored as an object.

import os
import re
import pandas as pd


#https://www.geeksforgeeks.org/update-column-value-of-csv-in-python/


# reading the csv file
df = pd.read_excel('Upload to dashboard - Untitled (1).xlsx', skiprows = 7)
df.drop(df.tail(7).index,inplace=True) # drop last n rows
print(df.head(3))

df['Publisher URL'] = df['Publisher URL'].fillna('')
df['Publisher URL'] = df['Publisher URL'].astype(str)
df['Publisher URL'] = df['Publisher URL'].str.replace('gerd#','')
print(df.dtypes)
df['Publisher URL2'] = df['Publisher URL'].str.replace('www.','')
trim = re.sub('(.*)(?:\bm\.)(.*)|(.*)','',df['Publisher URL'])
#https://docs.python.org/3/library/re.html#re.sub

print(df['trim'])
df.to_csv("C:/out.csv")

this is my output for print(df.dtypes)

[3 rows x 8 columns]
Unnamed: 0        object
Buying Agency     object
Advertiser        object
Publisher URL     object
Date              object
Buy Type          object
Ad Type           object
No. of Ads       float64

And my error is

  File ~\.spyder-py3\temp.py:30 in <module>
    trim = re.sub('(.*)(?:\bm\.)(.*)|(.*)','',df['Publisher URL'])

  File ~\Anaconda3\lib\re.py:210 in sub
    return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or bytes-like object

How can I change my script so I can use the re.sub function?

The regex is correct, I had some help from this question:Regex Exclusions

6
  • 1
    You'll also need to deal with nulls. So do df['Publisher URL2'] = df['Publisher URL2'].fillna('') or somethin g like that. Followed by df['Publisher URL2'] = df['Publisher URL2'].astype(str). Commented Oct 12, 2022 at 10:36
  • 1
    You probably wanted df['trim'] = df['Publisher URL'].str.replace(r'(.*)(?:\bm\.)(.*)', r'\1\2', regex=True) Commented Oct 12, 2022 at 10:37
  • hi @WiktorStribiżew, I want to replace mobile version of websites, so I want to find and replace the "m." version of websites ``` m.gsmarena.com/sony_xperia_5_iv-price-11838.php``` should be ``` gsmarena.com/sony_xperia_5_iv-price-11838.php``` However, when I try and find an replace "m." this includes websites like this ``` redfm.ie``` becomes ``` redfie``` which changes the website name. Commented Oct 12, 2022 at 16:16
  • 1
    So, df['trim'] = df['Publisher URL'].str.replace(r'^(.*?)\bm\.', r'\1', regex=True) will work as you need. Commented Oct 12, 2022 at 17:26
  • 1
    It seems you have been tricked by regex101 regex flavor setting, too. You are using a PCRE option, and the replacement pattern is $1, in Python, it is \1. Commented Oct 12, 2022 at 17:33

1 Answer 1

1

You can use

df['trim'] = df['Publisher URL'].str.replace(r'^(.*?)\bm\.', r'\1', regex=True)

Or, a better idea is to use

df['trim'] = df['Publisher URL'].str.replace(r'^(.*?[/.])m\.', r'\1', regex=True)

See regex demo #1 and regex demo #2

Details:

  • ^ - start of string
  • (.*?[/.]) - Group 1 (in Python, \1, not $1, refers to this group pattern match): any chars other than line break chars, as few as possible, and then a / or .
  • m\. - m. substring.
Sign up to request clarification or add additional context in comments.

1 Comment

that's worked, thank you so much for the indepth answer! I didn't know that regex101 had a separate python option as well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.