How to remove certain set of strings from a column output via pandas

Question

I'm trying to remove certain strings from a data-frame column, just would like to know how to achieve that in a better way , one way is with multiple replace but i want to avoid that.

Raw_Data

ctflex08 | SUCCESS | rc=0 | (stdout) server ntp-tichmond minpoll 4 maxpoll 10\nserver ntp-tichmond-b minpoll 4 maxpoll 10\nserver 127.127.1.0
ctfclx806 | SUCCESS | rc=0 | (stdout) server ntp-mary.example.com
ctfclx802 | SUCCESS | rc=0 | (stdout) server ntp-mary.example.com
ti-goyala | SUCCESS | rc=0 | (stdout) server ntp-tichmond minpoll 4 maxpoll 10\nserver ntp-tichmond-b minpoll 4 maxpoll 10

Data-frame Structure:

import pandas as pd
matchObj = ['(stdout)', 'server', 'minpoll', 'maxpoll' ]

df = pd.read_csv('ntp_server.txt', sep="|" , names=['Linux_Hosts', 'Host_Dist_version'])

df['Host_Dist_version'] =  df['Host_Dist_version'].replace("server", '',regex=True).replace("minpoll", '',regex=True)
print(df)

Current Output:

                      Linux_Hosts                                  Host_Dist_version
ctflex08      SUCCESS        rc=0    (stdout)  ntp-tichmond  4 maxpoll 10\n ntp-ti...
ctfclx806     SUCCESS        rc=0                      (stdout)  ntp-mary.example.com
ctfclx802     SUCCESS        rc=0                      (stdout)  ntp-mary.example.com
ti-goyala     SUCCESS        rc=0    (stdout)  ntp-tichmond  4 maxpoll 10\n ntp-ti...

Expected Output:

Linux_Hosts               Host_Dist_version
ctflex08                  ntp-tichmond  ntp-tichmond-b
ctfclx806                 ntp-mary.example.com
ctfclx802                 ntp-mary.example.com
ti-goyala                 ntp-tichmond ntp-tichmond-b

Is there a efficient way to Just pick the required strings and rest remove or mask them, eg ['ntp-mary', 'ntp-tichmond', 'ntp-tichmond-b'] just see these list values and pick them only and leave the rest.

While replacing the some special chars and strings its not working like..

SUCCESSS treated as a keyword and \n also not being removed.

user294110 · Accepted Answer · 2019-07-19 09:36:24Z

1

See the updated code:

import pandas as pd
df = pd.read_csv('ntp_server.txt', sep="|" , names=['Linux_Hosts','Status','RC','Host_Dist_version'])
pattern = r'(ntp+[^\s]+)'
df['Host_Dist_version'] = df['Host_Dist_version'].str.findall(pattern).str.join(' ')
df = df.drop(['Status','RC'], axis =1)
print(df)

Resulted Output:

  Linux_Hosts            Host_Dist_version
0   ctflex08   ntp-tichmond ntp-tichmond-b
1  ctfclx806          ntp-mary.example.com
2  ctfclx802          ntp-mary.example.com
3  ti-goyala   ntp-tichmond ntp-tichmond-b

Explanation: pattern is the regex which matches a sub-string containing word 'ntp' and captures everything until next space (which I think is the requirement), if you don't want to capture anything after the . then use (ntp+[^\s.]+) regex.

Also I created the DataFrame with 4 columns as the separator '|' suggests that there are 4 columns in text file, you can later drop 'Status' and 'RC' if not need them, hope this helps.

edited Jul 19, 2019 at 9:36

user294110

1692 gold badges3 silver badges17 bronze badges

answered Jul 18, 2019 at 14:18

ManojK

1,6403 gold badges11 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user294110 Over a year ago

Thansk Manojk but i'm looking for the expected output.

ManojK Over a year ago

Can you explain in bit detail, what are the difference between the solution and expected?

Collectives™ on Stack Overflow

How to remove certain set of strings from a column output via pandas

Raw_Data

Data-frame Structure:

Current Output:

Expected Output:

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Raw_Data

Data-frame Structure:

Current Output:

Expected Output:

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related