0

I have a data frame with one column,

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"]})

I want to add another column with the substring of files, the final dataframe should look like

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"], 'stain': ["PAS", "HE1", "HE1"]})

I try

DF["Stain"] = DF.apply(lambda row: row.files[re.search(r'[a-zA-Z]{2,}', row.files).start():], axis=1)

But it returned

AttributeError: 'NoneType' object has no attribute 'start'

What should I do?

2 Answers 2

1

If you want to extract last 3 characters from the files column you can do:

DF["stain"] = DF["files"].str[-3:]
print(DF)

Prints:

           files stain
0  S18-000344PAS   PAS
1  S18-001850HE1   HE1
2   S18-00344HE1   HE1

EDIT: Using regular expression to extract the stain:

DF["stain"] = DF["files"].str.extract(r"^(?:.{2,})-\d*(.+)")
print(DF)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. But is there a method that can also extract string longer than 3 character?
@pill45 What is the condition by which should the stain string be obtained?
It will start with alphabet (>=2) and may or may not follow by numbers (of any length).
@pill45 See my edit.
1

Here's one approach using the str accessor

DF[["files", "stain"]] = DF["files"].str.extract(pat="(.+\d)(\D.+)")
    files   stain
0   S18-000344  PAS
1   S18-001850  HE1
2   S18-00344   HE1

If you need to keep the extracted variable in the first column, you can do

DF["stain"] = DF["files"].str.extract(pat="(.+\d)(\D.+)")[1]
    files   stain
0   S18-000344PAS   PAS
1   S18-001850HE1   HE1
2   S18-00344HE1    HE1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.