1

I'm trying to extract some values from 1 column. Some rows have 3 values I want to with different setups.

My dataset:

     col1
0    1001100100 / hello street 2 a town1
1    1001100102 ;hello 3 towns2
2    STRZ19-0072 DT001001-0100 location1 town4
3    1001100103_hello street 3, town5
4    DT001002-0100 street  78
5    1001100107 DT001002-0102 street 6a town7

I need:

I need 3 new columns with different values focused on the 10 digit number starting with "100", the 'DT'number with 6 digits-4 digits and the street + town combined. Other values like 'STRZ19-0072' are not relevant.

I tried this:

df2 = df['col1'].str.extract(r"(?<col2>\d{10})|(?<col3>PR\d{6}-\d{4})|(?<col4>\w.*)")

This does not get me the DT-number from row 3.

Expected result:

     col2        col3           address 
0    1001100100  NaN            hello street 2 a town1
1    1001100102  NaN            hello 3 towns2
2    NaN         DT001001-0100  location1 town4
3    1001100103  NaN            hello street 3, town5
4    NaN         DT001002-0100  street  78
5    1001100107  DT001002-0102  street 6a town7

Appreciate the help and effort. Thank you!

1
  • 1
    Try ^(?:(?:(?!DT)\w+-\d+|(?P<col1>\d{10}))[\W_]+)?(?:(?P<col2>DT\d{6}-\d{4})[\W_]+)?(?P<col3>\w.*). See regex101.com/r/APgcbk/1 Commented Oct 28, 2022 at 18:56

2 Answers 2

2

I'd do it in 3 separate steps:

df["col2"] = df["col1"].str.extract(r"(100\d{7})")
df["col3"] = df["col1"].str.extract(r"(DT\d{6}-\d{4})")
df["col4"] = df["col1"].str.extract(r"([A-Za-z][A-Za-z\d\s,]+)$")
print(df)

Prints:

                                        col1        col2           col3                    col4
0        1001100100 / hello street 2 a town1  1001100100            NaN  hello street 2 a town1
1                 1001100102 ;hello 3 towns2  1001100102            NaN          hello 3 towns2
2  STRZ19-0072 DT001001-0100 location1 town4         NaN  DT001001-0100         location1 town4
3           1001100103_hello street 3, town5  1001100103            NaN   hello street 3, town5
4                    DT001002-0100 street 78         NaN  DT001002-0100               street 78
5   1001100107 DT001002-0102 street 6a town7  1001100107  DT001002-0102         street 6a town7
Sign up to request clarification or add additional context in comments.

Comments

2

Instead of using an alternation for all 3 groups, you can using single pattern with the 3 named capture groups after each other:

^(?:(?P<col2>\d{10})|(?![A-Z]{2}\d)\S+)?\s?(?P<col3>[A-Z]{2}\d{6}-\d{4})?[\W_]*(?P<col4>\w.*)

The pattern matches:

  • ^ Start of string
  • (?: Non capture group for the alternation
    • (?P<col2>\d{10})|(?![A-Z]{2}\d)\S+ Capture 10 digits in group col2 or match 1+ non whitspace chars if string does not start with 2 digits and a char A-Z
  • )? Close the non capture group
  • \s* Match optional whitespace chars
  • (?P<col3>[A-Z]{2}\d{6}-\d{4})? Optionally capture 2 chars A-Z, 6 digits - and 4 digits in group col3
  • [\W_]* Match optional non word chars and _
  • (?P<col4>\w.*) Capture a word character and match the rest of the line in group col4

Regex demo

import pandas as pd

data = [
    "1001100100 / hello street 2 a town1",
    "1001100102 ;hello 3 towns2",
    "STRZ19-0072 DT001001-0100 location1 town4",
    "1001100103_hello street 3, town5",
    "DT001002-0100 street  78",
    "1001100107 DT001002-0102 street 6a town7"
]

df = pd.DataFrame(data, columns=["col1"])
df2 = pd.concat(
    [
        df['col1'],
        df['col1'].str.extract(r"^(?:(?P<col2>\d{10})|(?![A-Z]{2}\d)\S+)?\s*(?P<col3>[A-Z]{2}\d{6}-\d{4})?[\W_]*(?P<col4>\w.*)")
    ], axis=1
)

print(df2)

Output
                                        col1        col2           col3                    col4
0        1001100100 / hello street 2 a town1  1001100100            NaN  hello street 2 a town1
1                 1001100102 ;hello 3 towns2  1001100102            NaN          hello 3 towns2
2  STRZ19-0072 DT001001-0100 location1 town4         NaN  DT001001-0100         location1 town4
3           1001100103_hello street 3, town5  1001100103            NaN   hello street 3, town5
4                   DT001002-0100 street  78         NaN  DT001002-0100              street  78
5   1001100107 DT001002-0102 street 6a town7  1001100107  DT001002-0102         street 6a town7

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.