Loop and arrays of strings in python

Question

I have the following data set:

column1

HL111
PG3939HL11
HL339PG
RC--HL--PG

I am attempting to write a function that does the following:

Loop through each row of column1
Pull only the alphabet and put into an array
If the array has "HL" in it, remove it from the array UNLESS HL is the only word in the array.
Take the first word in the array and output results.

So for the above example, my array (step2) would look like this:

[HL]
[PG,HL]
[HL,PG]
[RC,HL,PG]

and my desired final output (step4) would look like this:

desired_column

HL
PG
PG
RC

I have the code for step 2, and it seems to work fine

df['array_column'] = (df.column1.str.extractall('([A-Z]+)')
                    .unstack()
                    .values.tolist())

But I don't know how to get from here to my final output (step4).

What do you expect if the cell has no letters? !!!!! or 11111? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 12, 2018 at 6:33
if the cell has no letters, then results can be blank or null — pynewbee
– pynewbee, Commented Oct 12, 2018 at 6:34
I added an answer that handles cells with no letters and follows your initial logic. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 12, 2018 at 7:20

Wiktor Stribiżew · Accepted Answer · 2018-10-12 07:19:54Z

1

You may achieve what you need by replacing all non-letters first, then extracting pairs of letters and then applying some custom logic to extract the necessary value from the array:

>>> df['array_column'].str.replace('[^A-Z]+', '').str.findall('([A-Z]{2})').apply(lambda d: [''] if len(d) == 0 else d).apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0])
0    HL
1    PG
2    PG
3    RC
Name: array_column, dtype: object
>>>

Details

.replace('[^A-Z]+', '') - remove all chars other the uppercase letters
.str.findall('([A-Z]{2})') - extract pairs of letters
.apply(lambda d: [''] if len(d) == 0 else d) will add an empty item if there is no regex match in the previous step
.apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0]) - custom logic: if the list length is 1 and it is equal to HL, keep it, else remove all HL and get the first element

edited Oct 12, 2018 at 7:19

answered Oct 12, 2018 at 6:54

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rakesh · Accepted Answer · 2018-10-12 06:51:47Z

0

This is one approach using apply

Demo:

import re
import pandas as pd

def checkValue(value):
    value = re.findall(r"[A-Z]{2}", value)
    if (len(value) > 1) and ("HL" in value):
        return [i for i in value if i != "HL"][0]
    else:
        return value[0]    

df = pd.DataFrame({"column1": ["HL111", "PG3939HL11", "HL339PG", "RC--HL--PG"]})
print(df.column1.apply(checkValue))

Output:

0    HL
1    PG
2    PG
3    RC
Name: column1, dtype: object

answered Oct 12, 2018 at 6:51

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Comments

Sven Harris · Accepted Answer · 2018-10-12 07:11:02Z

0

You can do something like this (or probably something more elegant), what you had already gets you to a fairly nice structure where you can use groupby to complete your solution

def extract_relevant_str(grp):
    ret_val = None
    if "HL" in grp[0].tolist() and len(grp) == 1:
        ret_val = "HL"
    elif len(grp) >= 1:
        ret_val = grp.loc[grp[0] != "HL", 0].iloc[0]
    return ret_val

items = df.column1.str.extractall('([A-Z]+)')
items.reset_index().groupby("level_0").apply(extract_relevant_str)

Output:

level_0
0    HL
1    PG
2    PG
3    RC
dtype: object

answered Oct 12, 2018 at 7:11

Sven Harris

2,9491 gold badge13 silver badges21 bronze badges

Collectives™ on Stack Overflow

Loop and arrays of strings in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related