1

I have many of this type of url :

http://www.example.com/some-text-to-get/jkl/another-text-to-get

I want to be able to get this :

["some-text-to-get", "another-text-to-get"]

I tried this :

re.findall(".*([[a-z]*-[a-z]*]*).*", "http://www.example.com/some-text-to-get/jkl/another-text-to-get")

but it's not working. Any idea ?

4 Answers 4

2

You could capture the 2 parts in a capturing group:

^https?://[^/]+/([^/]+).*/(.*)$

That would match:

  • ^ Match from the start of the string
  • https?:// Match http with an optional s followed by ://
  • [^/]+/ Match not a forward slash using a negated character class followed by a forward slash
  • ([^/]+) Capture in a group (group 1) not a forward slash
  • .* Match any character zero or more times
  • / Match literally (this is the last slash because the .* is greedy
  • (.*)$ Match in a capturing group (group 2) zero or more times any character and assert the end of the line $

Your matches are in the first and second capturing group.

Demo

Or you could parse the url, get the path, split by a / and get your parts by index:

from urlparse import urlparse

o = urlparse('http://www.example.com/some-text-to-get/jkl/another-text-to-get')
parts = filter(None, o.path.split('/'))
print(parts[0])
print(parts[2])

Or if you want to get the parts that contain a - you could use:

parts = filter(lambda x: '-' in x, o.path.split('/'))
print(parts)

Demo

Sign up to request clarification or add additional context in comments.

1 Comment

Python 3: from urllib.parse import urlparse
1

You can use a lookbehind and lookahead:

import re
s = 'http://www.example.com/some-text-to-get/jkl/another-text-to-get'
final_result = re.findall('(?<=\.\w{3}/)[a-z\-]+|[a-z\-]+(?=$)', s)

Output:

['some-text-to-get', 'another-text-to-get']

1 Comment

I want only lowercase words, is that possible to do ? Can't make it work with [a-z]
0

Given:

>>> s
"http://www.example.com/some-text-to-get/jkl/another-text-to-get"

You can use this regex:

>>> re.findall(r"/([a-z-]+)(?:/|$)", s)
['some-text-to-get', 'another-text-to-get']

Of course you can do this with Python string methods and a list comprehension:

>>> [e for e in s.split('/') if '-' in e]
['some-text-to-get', 'another-text-to-get']

Comments

0

You could capture it using this regular expression:

((?:[a-z]+-)+[a-z]+)

  • [a-z]+ match one or more character

  • (?:[a-z]+-) don't capture in group

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.