How to modify a column of strings in pandas using a few conditionals (regex?)

Question

I have a df that looks like this

    Identifier      Ticker      ISIN        Relationship Country .......
0   5.037663e+09    BTGGg.F     D10080162   Supplier     Germany .......
1   4.295870e+09    IVXG.DE     NaN         Supplier     Germany .......
2   5.043321e+09    SAPG.DE     D66992104   Customer     Germany
3   4.295869e+09    BMWG.DE     D12096109   Customer     Germany    
4   4.295870e+09    DTEGn.DE    D2035M136   Customer     Germany
5   4.295870e+09    IFXGn.DE    D35415104   Supplier     Germany
6   4.295869e+09    NSUG.DE     D04312100   Customer     Germany
7   5.000074e+09    EVKn.DE     D2R90Y117   Customer     Germany
8   4.295869e+09    LHAG.DE     D1908N106   Customer     Germany
9   4.295869e+09    MTXGn.DE    D5565H104   Supplier     Germany
10  4.295869e+09    SIEGn.DE    D69671218   Supplier     Germany
11  4.295870e+09    TKAG.DE     D8398Q119   Supplier     Germany
12  5.059963e+09    BNRGn.DE    D12459117   Customer     Germany
13  4.295869e+09    RHMG.DE     D65111102   Supplier     Germany
14  5.001195e+09    GBFG.DE     D11648108   Supplier     Germany
15  4.295869e+09    NXUG.DE     D5650J106   Customer     Germany
16  4.295870e+09    DPWGn.DE    D19225107   Supplier     Germany
17  4.295870e+09    ILM1k.DE    D22430116   Supplier     Germany
18  4.295869e+09    ADSGn.DE    D0066B185   Customer     Germany
19. 5.125125e+12    DBS.SG      D12300523.  Supplier     SG
........................................................................

In the rows where the df['Country'] == 'Germany',

I have 2 functions I want to perform.

function 1: I want to isolate the rows where there is a smaller case letter, any smaller case letter before the ".", I want to drop it, so BTGGg.F would become BTGG.F, 1LM1k.DE would become 1LM1.DE but NXUG.DE would be unaffected.

Using that new dataframe after the first function.

function 2: Then for the rows where there is an upper case G before the ".", I want to drop the G so RHMG.DE would become RHM.DE but 1LM1.DE would be unaffected.

If I wanted to drop just the letter before the ".", the function would be simple like x = x.replace(x[x.find(".")-1],"").

But I don't know how to do this in a dataframe, and I don't know how to apply the conditions I mentioned. Can it be done and if so, how?

I was thinking it might look something like this but this obviously didn't work I already tried it.

df.loc[df['Country'].eq('Germany'),'Ticker'] = df.loc[df['Country'].eq('Germany'),'Ticker'].str.replace((df['Ticker'][df['Ticker'].find(".")-1],"") if df['Ticker'][df['Ticker'].find(".")-1] == '([a-z])')

followed by

df.loc[df['Country'].eq('Germany'),'Ticker'] = df.loc[df['Country'].eq('Germany'),'Ticker'].str.replace((df['Ticker'][df['Ticker'].find(".")-1],"") if df['Ticker'][df['Ticker'].find(".")-1] == 'G')

Here is what the output should look like after the first round, from Country == Germany, remove the first lower case letter preceding "." :

    Identifier      Ticker      ISIN        Relationship Country .......
0   5.037663e+09    BTGG.F      D10080162   Supplier     Germany .......
1   4.295870e+09    IVXG.DE     NaN         Supplier     Germany .......
2   5.043321e+09    SAPG.DE     D66992104   Customer     Germany
3   4.295869e+09    BMWG.DE     D12096109   Customer     Germany    
4   4.295870e+09    DTEG.DE     D2035M136   Customer     Germany
5   4.295870e+09    IFXG.DE     D35415104   Supplier     Germany
6   4.295869e+09    NSUG.DE     D04312100   Customer     Germany
7   5.000074e+09    EVK.DE      D2R90Y117   Customer     Germany
8   4.295869e+09    LHAG.DE     D1908N106   Customer     Germany
9   4.295869e+09    MTXG.DE     D5565H104   Supplier     Germany
10  4.295869e+09    SIEG.DE     D69671218   Supplier     Germany
11  4.295870e+09    TKAG.DE     D8398Q119   Supplier     Germany
12  5.059963e+09    BNRG.DE     D12459117   Customer     Germany
13  4.295869e+09    RHMG.DE     D65111102   Supplier     Germany
14  5.001195e+09    GBFG.DE     D11648108   Supplier     Germany
15  4.295869e+09    NXUG.DE     D5650J106   Customer     Germany
16  4.295870e+09    DPWG.DE     D19225107   Supplier     Germany
17  4.295870e+09    ILM1.DE     D22430116   Supplier     Germany
18  4.295869e+09    ADSG.DE     D0066B185   Customer     Germany
19. 5.125125e+12    DBS.SG      D12300523   Supplier     SG
........................................................................

this is after the second round, from Country == Germany, remove the first upper case "G" preceding the ".":

    Identifier      Ticker      ISIN        Relationship Country .......
0   5.037663e+09    BTG.F       D10080162   Supplier     Germany .......
1   4.295870e+09    IVX.DE      NaN         Supplier     Germany .......
2   5.043321e+09    SAP.DE      D66992104   Customer     Germany
3   4.295869e+09    BMW.DE      D12096109   Customer     Germany    
4   4.295870e+09    DTE.DE      D2035M136   Customer     Germany
5   4.295870e+09    IFX.DE      D35415104   Supplier     Germany
6   4.295869e+09    NSU.DE      D04312100   Customer     Germany
7   5.000074e+09    EVK.DE      D2R90Y117   Customer     Germany
8   4.295869e+09    LHA.DE      D1908N106   Customer     Germany
9   4.295869e+09    MTX.DE      D5565H104   Supplier     Germany
10  4.295869e+09    SIE.DE      D69671218   Supplier     Germany
11  4.295870e+09    TKA.DE      D8398Q119   Supplier     Germany
12  5.059963e+09    BNR.DE      D12459117   Customer     Germany
13  4.295869e+09    RHM.DE      D65111102   Supplier     Germany
14  5.001195e+09    GBF.DE      D11648108   Supplier     Germany
15  4.295869e+09    NXU.DE      D5650J106   Customer     Germany
16  4.295870e+09    DPW.DE      D19225107   Supplier     Germany
17  4.295870e+09    ILM1.DE     D22430116   Supplier     Germany
18  4.295869e+09    ADS.DE      D0066B185   Customer     Germany
19. 5.125125e+12    DBS.SG      D12300523   Supplier     SG
........................................................................

kindly post the expected output

sammywemmy
– sammywemmy

2020-05-21 00:11:10 +00:00
Commented May 21, 2020 at 0:11 — sammywemmy
– sammywemmy, Commented May 21, 2020 at 0:11
okay I've updated the question with the expected output

anarchy
– anarchy

2020-05-21 04:56:26 +00:00
Commented May 21, 2020 at 4:56 — anarchy
– anarchy, Commented May 21, 2020 at 4:56

Andrej Kesely · Accepted Answer · 2020-05-21 07:47:06Z

1

df.loc[df.Country == 'Germany', 'Ticker'] = df[df.Country == 'Germany']['Ticker'].str.replace(r'[a-z](?=\.)', '').str.replace(r'G(?=\.)', '')
print(df)

Prints:

      Identifier   Ticker        ISIN Relationship  Country
0   5.037663e+09    BTG.F   D10080162     Supplier  Germany
1   4.295870e+09   IVX.DE         NaN     Supplier  Germany
2   5.043321e+09   SAP.DE   D66992104     Customer  Germany
3   4.295869e+09   BMW.DE   D12096109     Customer  Germany
4   4.295870e+09   DTE.DE   D2035M136     Customer  Germany
5   4.295870e+09   IFX.DE   D35415104     Supplier  Germany
6   4.295869e+09   NSU.DE   D04312100     Customer  Germany
7   5.000074e+09   EVK.DE   D2R90Y117     Customer  Germany
8   4.295869e+09   LHA.DE   D1908N106     Customer  Germany
9   4.295869e+09   MTX.DE   D5565H104     Supplier  Germany
10  4.295869e+09   SIE.DE   D69671218     Supplier  Germany
11  4.295870e+09   TKA.DE   D8398Q119     Supplier  Germany
12  5.059963e+09   BNR.DE   D12459117     Customer  Germany
13  4.295869e+09   RHM.DE   D65111102     Supplier  Germany
14  5.001195e+09   GBF.DE   D11648108     Supplier  Germany
15  4.295869e+09   NXU.DE   D5650J106     Customer  Germany
16  4.295870e+09   DPW.DE   D19225107     Supplier  Germany
17  4.295870e+09  ILM1.DE   D22430116     Supplier  Germany
18  4.295869e+09   ADS.DE   D0066B185     Customer  Germany
19  5.125125e+12   DBS.SG  D12300523.     Supplier       SG

answered May 21, 2020 at 7:47

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

anarchy Over a year ago

wow thank you! I've been trying to figure this out for ages! do you know where I can learn the regex syntax for stuff like this?

Andrej Kesely Over a year ago

@anarchy Good starting place is re module documentation and you can try regexes on regex101 for example.

anarchy Over a year ago

thanks! does using (r'G?[a-z]?(?=\.)', '') achieve the same result as using both those methods above.

Andrej Kesely Over a year ago

@anarchy Yes, you can try it. But I separated it to be more explicit.

anarchy Over a year ago

okay thanks, just wondering it there would be an error that's all

OhadP · Accepted Answer · 2020-05-21 07:51:23Z

0

For dealing with the functions you mentioned, you can use some regex to perform them both

re.sub(r'G?[a-z]?(?=\.)', '', 'HERE_YOUR_COUNTRY_THINGY')

You didn't given more rules

answered May 21, 2020 at 7:51

OhadP

815 bronze badges

Collectives™ on Stack Overflow

How to modify a column of strings in pandas using a few conditionals (regex?)

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related