2

I have a dataframe of multiple movies containing synopsis.

Title        Synopsis
Movie1       Old Macdonald had a farm         [Written by ABC rewrite] 
Movie2       Wheels on the bus                 (Source: Melon)
Movie3       Tayo the bus                      [Produced by Wills Garage]
Movie4       James and Giant Apple             (Source: Kismet)

I'd like to remove the trailing words that are not required for NLP such that I get a dataframe below

Title        Synopsis
Movie1       Old Macdonald had a farm         
Movie2       Wheels on the bus                
Movie3       Tayo the bus                      
Movie4       James and Giant Apple            

I've tried the following code but my synopsis column ends up with some string like "0"Iodfosomhgooad,somh...\n1GaBauadFal..." Was wondering if how i could resolve this, appreciate any form of help, thank you.

removelist = [('[Written by]', '') ,('(Source:)', '')]
               
for old, new in removelist:
    df['Synopsis'] = re.sub(old, new, str(df['Synopsis']))



2
  • Are those un-necessary data is present in every row? Commented Feb 10, 2021 at 13:01
  • @RishabhKumar, not necessarily, the unnecessary data can appear in any row. Commented Feb 10, 2021 at 13:03

2 Answers 2

1

You can use

df['Synopsis'] = df['Synopsis'].str.replace(r'\s*(?:\[[^][]*]|\([^()]*\))\s*$', '')

See the regex demo.

Details:

  • \s* - zero or more whitespaces
  • (?:\[[^][]*]|\([^()]*\)) - either
    • \[[^][]*] - a [, any zero or more chars other than [ and ] and then a ] char
    • | - or
    • \([^()]*\) - a (, any zero or more chars other than ( and ) and then a ) char
  • \s* - zero or more whitespaces
  • $ - end of string.
Sign up to request clarification or add additional context in comments.

Comments

0

You can use the regex replace method directly available to strings in Pandas DataFrames.

data['Synopsis'] = data['Synopsis'].str.replace('\[.*\]$|\(.*\)$','', regex=True)

match anything between [] at end of string

\[.*\]$

multiple string patterns

|

match anything between () at end of string

\(.*\)$

The result of your sample is:

                         Synopsis
Title                            
Movie1  Old Macdonald had a farm 
Movie2         Wheels on the bus 
Movie3              Tayo the bus 
Movie4     James and Giant Apple 

1 Comment

I upvoted, not sure who downvoted though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.