1

I'm trying to remove certain strings from a data-frame column, just would like to know how to achieve that in a better way , one way is with multiple replace but i want to avoid that.

Raw_Data

ctflex08 | SUCCESS | rc=0 | (stdout) server ntp-tichmond minpoll 4 maxpoll 10\nserver ntp-tichmond-b minpoll 4 maxpoll 10\nserver 127.127.1.0
ctfclx806 | SUCCESS | rc=0 | (stdout) server ntp-mary.example.com
ctfclx802 | SUCCESS | rc=0 | (stdout) server ntp-mary.example.com
ti-goyala | SUCCESS | rc=0 | (stdout) server ntp-tichmond minpoll 4 maxpoll 10\nserver ntp-tichmond-b minpoll 4 maxpoll 10

Data-frame Structure:

import pandas as pd
matchObj = ['(stdout)', 'server', 'minpoll', 'maxpoll' ]

df = pd.read_csv('ntp_server.txt', sep="|" , names=['Linux_Hosts', 'Host_Dist_version'])

df['Host_Dist_version'] =  df['Host_Dist_version'].replace("server", '',regex=True).replace("minpoll", '',regex=True)
print(df)

Current Output:

                      Linux_Hosts                                  Host_Dist_version
ctflex08      SUCCESS        rc=0    (stdout)  ntp-tichmond  4 maxpoll 10\n ntp-ti...
ctfclx806     SUCCESS        rc=0                      (stdout)  ntp-mary.example.com
ctfclx802     SUCCESS        rc=0                      (stdout)  ntp-mary.example.com
ti-goyala     SUCCESS        rc=0    (stdout)  ntp-tichmond  4 maxpoll 10\n ntp-ti...

Expected Output:

Linux_Hosts               Host_Dist_version
ctflex08                  ntp-tichmond  ntp-tichmond-b
ctfclx806                 ntp-mary.example.com
ctfclx802                 ntp-mary.example.com
ti-goyala                 ntp-tichmond ntp-tichmond-b

Is there a efficient way to Just pick the required strings and rest remove or mask them, eg ['ntp-mary', 'ntp-tichmond', 'ntp-tichmond-b'] just see these list values and pick them only and leave the rest.

While replacing the some special chars and strings its not working like..

SUCCESSS treated as a keyword and \n also not being removed.

1 Answer 1

1

See the updated code:

import pandas as pd
df = pd.read_csv('ntp_server.txt', sep="|" , names=['Linux_Hosts','Status','RC','Host_Dist_version'])
pattern = r'(ntp+[^\s]+)'
df['Host_Dist_version'] = df['Host_Dist_version'].str.findall(pattern).str.join(' ')
df = df.drop(['Status','RC'], axis =1)
print(df)

Resulted Output:

  Linux_Hosts            Host_Dist_version
0   ctflex08   ntp-tichmond ntp-tichmond-b
1  ctfclx806          ntp-mary.example.com
2  ctfclx802          ntp-mary.example.com
3  ti-goyala   ntp-tichmond ntp-tichmond-b

Explanation: pattern is the regex which matches a sub-string containing word 'ntp' and captures everything until next space (which I think is the requirement), if you don't want to capture anything after the . then use (ntp+[^\s.]+) regex.

Also I created the DataFrame with 4 columns as the separator '|' suggests that there are 4 columns in text file, you can later drop 'Status' and 'RC' if not need them, hope this helps.

Sign up to request clarification or add additional context in comments.

2 Comments

Thansk Manojk but i'm looking for the expected output.
Can you explain in bit detail, what are the difference between the solution and expected?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.