Adding column by substring from another column in Pandas

Question

I have a data frame with one column,

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"]})

I want to add another column with the substring of files, the final dataframe should look like

DF = pd.DataFrame({'files': ["S18-000344PAS", "S18-001850HE1", "S18-00344HE1"], 'stain': ["PAS", "HE1", "HE1"]})

I try

DF["Stain"] = DF.apply(lambda row: row.files[re.search(r'[a-zA-Z]{2,}', row.files).start():], axis=1)

But it returned

AttributeError: 'NoneType' object has no attribute 'start'

What should I do?

Andrej Kesely · Accepted Answer · 2022-09-04 21:06:47Z

1

If you want to extract last 3 characters from the files column you can do:

DF["stain"] = DF["files"].str[-3:]
print(DF)

Prints:

           files stain
0  S18-000344PAS   PAS
1  S18-001850HE1   HE1
2   S18-00344HE1   HE1

EDIT: Using regular expression to extract the stain:

DF["stain"] = DF["files"].str.extract(r"^(?:.{2,})-\d*(.+)")
print(DF)

answered Sep 4, 2022 at 20:50

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. But is there a method that can also extract string longer than 3 character?

@pill45 What is the condition by which should the stain string be obtained?

It will start with alphabet (>=2) and may or may not follow by numbers (of any length).

@pill45 See my edit.

Just James · Accepted Answer · 2022-09-04 21:09:16Z

1

Here's one approach using the str accessor

DF[["files", "stain"]] = DF["files"].str.extract(pat="(.+\d)(\D.+)")

    files   stain
0   S18-000344  PAS
1   S18-001850  HE1
2   S18-00344   HE1

If you need to keep the extracted variable in the first column, you can do

DF["stain"] = DF["files"].str.extract(pat="(.+\d)(\D.+)")[1]

    files   stain
0   S18-000344PAS   PAS
1   S18-001850HE1   HE1
2   S18-00344HE1    HE1

answered Sep 4, 2022 at 20:57

Just James

1,2524 silver badges8 bronze badges