2

I have a data frame column name "New" below

df = pd.DataFrame({'New' : ['emerald shines bright(happy)(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)', 'this is just a text'],
'UI': ['AOT', 'BOT', 'LOV', 'HAP', 'NON']})

Now I want to extract the various IDs for example ABCED', AxYBD, and id in the 'http' into another column.

But when I used

df['New_col'] = df['New'].str.extract(r'.*\((.*)\).*',expand=True)

I can't get it to work well as the whole parenthesis for instance (ABCED ID - 1234556) is returned. More so, the http id 234555 is not returned.

Also, can someone clean the first column to removed the ID in paranthesis and have something like,

                               New            UI    New_col
0  emerald shines bright(happy)               AOT    1234556
1   honey in the bread                        BOT  123467890
2        http/ABCED/id/234555                 LOV     234555
3        healing strenght                     HAP    1234556
4  this is just a text                        NON
8
  • 1
    You have to enclose your regex in quotes: extract(r'.*\((.*)\).*',expand=True) Commented Oct 27, 2022 at 10:15
  • Try df['New_col'] = df['New'].str.extract(r'.*(?:\(\D*|http\S*/id/)(\d+)',expand=False) Commented Oct 27, 2022 at 10:16
  • @WiktorStribiżew thank you i go it now. Nick Thanks for pointing that out. Commented Oct 27, 2022 at 12:35
  • Looks like my answer is yielding the expected output. Commented Oct 27, 2022 at 12:42
  • Ok - I have updated my answer to suit your new specifications Commented Oct 27, 2022 at 13:59

4 Answers 4

2

Probably not the most elegant answer, however, I think this does what you want it to do,
based on the NEW criteria.

import re

df = pd.DataFrame({'New' : ['emerald shines bright(happy)(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)', 'this is just a text'],
'UI': ['AOT', 'BOT', 'LOV', 'HAP', 'NON']})

def grab_id(row):
    text = re.findall(r'\(([A-Za-z]+)\sID\s-\s?(\d+)\)|/([0-9]+)', row)
    if text:
        if text[0][0]:
            return text[0][1]
        else:
            return text[0][2]
    else:
        return ""
    
    
def remove_ID_in_brackets(row):
    text = re.sub(r'\(([A-Za-z]+)\sID\s-\s?(\d+)\)', '', row)
    
    return text

df['New_Col'] = df['New'].apply(grab_id)
df['New'] = df['New'].apply(remove_ID_in_brackets)

This is what df looks like now:

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

the raw string works however it didn't return the (happy) in row 0. i.e it extracts all parenthesis. it should only extract the parenthesis with the ID in row 0 as specified in required output.
@Kcndze - have updated to deal with the (happy) requirement
1

You can do it with the following code:

reg_expression = r'.*\(.*ID\s*-\s*(.*)\)|http\/.*\/id\/(\d*)'
extract_text = lambda row: row[0][0] if row[0][0] else row[0][1]

df['New_col'] = df['New'].str.findall(reg_expression).apply(extract_text)

Output:

enter image description here

Explaination:

Based on your dummy example you have to capture two patterns:

  • HTTP cases pattern http\/.*\/id\/(\d*)

    e.g http/ABCED/id/234555

  • NO HTTP cases pattern: .*\(.*ID\s*-\s*(.*)\)

    e.g emerald shines bright(ABCED ID - 1234556)

and combine them in one regex expression by using the or (|) operator.

Then because there are multiple matches we can take the value from the match by using a lambda function.

Comments

0

You can use

import pandas as pd
df = pd.DataFrame({'New' : ['emerald shines bright(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)'], 'UI': ['AOT', 'BOT', 'LOV', 'HAP']})
df['New_col'] = df['New'].str.extract(r'.*(?:\(\D*|http\S*/id/)(\d+)',expand=False)

Output:

>>> print(df.to_string())
                                         New   UI    New_col
0  emerald shines bright(ABCED ID - 1234556)  AOT    1234556
1   honey in the bread(ABCED ID - 123467890)  BOT  123467890
2                       http/ABCED/id/234555  LOV     234555
3        healing strenght(AxYBD ID -1234556)  HAP    1234556

See the regex demo. Details:

  • .* - any zero or more chars other than line break chars as many as possible
  • (?:\(\D*|http\S*/id/) - either ( + zero or more non-digit chars, or http followed with zero or more non-whitespaces, and then /id/
  • (\d+) - Group 1: one or more digits.

Comments

-1

r'[i,d,I,D]{2}.*?(\d.*?)\D' probably this one help

Edited: /?\(?(\w{5}) ?/?[i,d,I,D]{2} it's looks like you need letters, not digits

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.