2

Pandas column contains a series of urls. I'd like to extract a substring from the url. MRE code below.

s = pd.Series(['https://url-location/img/xxxyyy_image1.png'])

s.apply(lambda x: x[x.find("/")+1:st.find("_")])

I'd like to extract xxxyyy and store them into a new column.

2 Answers 2

3

You can use

>>> s.str.extract(r'.*/([^_]+)')
        0
0  xxxyyy

See the regex demo. Details:

  • .* - zero or more chars other than line break chars as many as possible
  • / - a slash
  • ([^_]+) - Capturing group 1 (the value captured into this group will be the actual return value of Series.str.extract): one or more chars other than _ char.
Sign up to request clarification or add additional context in comments.

2 Comments

how does it skip over /img and know to look at the one _ in the substring?
@kms .* is a greedily quantified pattern, it grabs the whole string at first. The engine starts backtracking then, trying to match some text with the subsequent patterns. So the / char found is the last / char that is followed by one or more chars other than _. The [^_] is a negated character class, it matches any char other than a _ char, so it cannot match across more _s, it will stop before the first _ or end of string. Here is my YT video about backtracking in regex.
1

Also possible:

s.str.split('/').str[-1].str.split('_').str[0]
# Out[224]: xxxyyy

This works, because .str allows for the slice annotation. So .str[-1] will provide the last element after the split for example.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.