0

I have the following data set:

column1

HL111
PG3939HL11
HL339PG
RC--HL--PG

I am attempting to write a function that does the following:

  1. Loop through each row of column1
  2. Pull only the alphabet and put into an array
  3. If the array has "HL" in it, remove it from the array UNLESS HL is the only word in the array.
  4. Take the first word in the array and output results.

So for the above example, my array (step2) would look like this:

[HL]
[PG,HL]
[HL,PG]
[RC,HL,PG]

and my desired final output (step4) would look like this:

desired_column

HL
PG
PG
RC

I have the code for step 2, and it seems to work fine

df['array_column'] = (df.column1.str.extractall('([A-Z]+)')
                    .unstack()
                    .values.tolist())

But I don't know how to get from here to my final output (step4).

3
  • What do you expect if the cell has no letters? !!!!! or 11111? Commented Oct 12, 2018 at 6:33
  • if the cell has no letters, then results can be blank or null Commented Oct 12, 2018 at 6:34
  • I added an answer that handles cells with no letters and follows your initial logic. Commented Oct 12, 2018 at 7:20

3 Answers 3

1

You may achieve what you need by replacing all non-letters first, then extracting pairs of letters and then applying some custom logic to extract the necessary value from the array:

>>> df['array_column'].str.replace('[^A-Z]+', '').str.findall('([A-Z]{2})').apply(lambda d: [''] if len(d) == 0 else d).apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0])
0    HL
1    PG
2    PG
3    RC
Name: array_column, dtype: object
>>> 

Details

  • .replace('[^A-Z]+', '') - remove all chars other the uppercase letters
  • .str.findall('([A-Z]{2})') - extract pairs of letters
  • .apply(lambda d: [''] if len(d) == 0 else d) will add an empty item if there is no regex match in the previous step
  • .apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0]) - custom logic: if the list length is 1 and it is equal to HL, keep it, else remove all HL and get the first element
Sign up to request clarification or add additional context in comments.

Comments

0

This is one approach using apply

Demo:

import re
import pandas as pd

def checkValue(value):
    value = re.findall(r"[A-Z]{2}", value)
    if (len(value) > 1) and ("HL" in value):
        return [i for i in value if i != "HL"][0]
    else:
        return value[0]    

df = pd.DataFrame({"column1": ["HL111", "PG3939HL11", "HL339PG", "RC--HL--PG"]})
print(df.column1.apply(checkValue))

Output:

0    HL
1    PG
2    PG
3    RC
Name: column1, dtype: object

Comments

0

You can do something like this (or probably something more elegant), what you had already gets you to a fairly nice structure where you can use groupby to complete your solution

def extract_relevant_str(grp):
    ret_val = None
    if "HL" in grp[0].tolist() and len(grp) == 1:
        ret_val = "HL"
    elif len(grp) >= 1:
        ret_val = grp.loc[grp[0] != "HL", 0].iloc[0]
    return ret_val

items = df.column1.str.extractall('([A-Z]+)')
items.reset_index().groupby("level_0").apply(extract_relevant_str)

Output:

level_0
0    HL
1    PG
2    PG
3    RC
dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.