I have the following data set:
column1
HL111
PG3939HL11
HL339PG
RC--HL--PG
I am attempting to write a function that does the following:
- Loop through each row of column1
- Pull only the alphabet and put into an array
- If the array has "HL" in it, remove it from the array UNLESS HL is the only word in the array.
- Take the first word in the array and output results.
So for the above example, my array (step2) would look like this:
[HL]
[PG,HL]
[HL,PG]
[RC,HL,PG]
and my desired final output (step4) would look like this:
desired_column
HL
PG
PG
RC
I have the code for step 2, and it seems to work fine
df['array_column'] = (df.column1.str.extractall('([A-Z]+)')
.unstack()
.values.tolist())
But I don't know how to get from here to my final output (step4).
!!!!!or11111?