regex : get part of text from url data

Question

I have many of this type of url :

http://www.example.com/some-text-to-get/jkl/another-text-to-get

I want to be able to get this :

["some-text-to-get", "another-text-to-get"]

I tried this :

re.findall(".*([[a-z]*-[a-z]*]*).*", "http://www.example.com/some-text-to-get/jkl/another-text-to-get")

but it's not working. Any idea ?

The fourth bird · Accepted Answer · 2018-07-08 16:21:47Z

2

You could capture the 2 parts in a capturing group:

^https?://[^/]+/([^/]+).*/(.*)$

That would match:

^ Match from the start of the string
https?:// Match http with an optional s followed by ://
[^/]+/ Match not a forward slash using a negated character class followed by a forward slash
([^/]+) Capture in a group (group 1) not a forward slash
.* Match any character zero or more times
/ Match literally (this is the last slash because the .* is greedy
(.*)$ Match in a capturing group (group 2) zero or more times any character and assert the end of the line $

Your matches are in the first and second capturing group.

Demo

Or you could parse the url, get the path, split by a / and get your parts by index:

from urlparse import urlparse

o = urlparse('http://www.example.com/some-text-to-get/jkl/another-text-to-get')
parts = filter(None, o.path.split('/'))
print(parts[0])
print(parts[2])

Or if you want to get the parts that contain a - you could use:

parts = filter(lambda x: '-' in x, o.path.split('/'))
print(parts)

Demo

edited Jul 8, 2018 at 16:21

answered Jul 8, 2018 at 15:20

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

VISWESWARAN NAGASIVAM Over a year ago

Python 3: from urllib.parse import urlparse

Ajax1234 · Accepted Answer · 2018-07-08 15:31:07Z

1

You can use a lookbehind and lookahead:

import re
s = 'http://www.example.com/some-text-to-get/jkl/another-text-to-get'
final_result = re.findall('(?<=\.\w{3}/)[a-z\-]+|[a-z\-]+(?=$)', s)

Output:

['some-text-to-get', 'another-text-to-get']

edited Jul 8, 2018 at 15:31

answered Jul 8, 2018 at 15:19

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

1 Comment

Mohamed AL ANI Over a year ago

I want only lowercase words, is that possible to do ? Can't make it work with [a-z]

dawg · Accepted Answer · 2018-07-08 15:41:41Z

0

Given:

>>> s
"http://www.example.com/some-text-to-get/jkl/another-text-to-get"

You can use this regex:

>>> re.findall(r"/([a-z-]+)(?:/|$)", s)
['some-text-to-get', 'another-text-to-get']

Of course you can do this with Python string methods and a list comprehension:

>>> [e for e in s.split('/') if '-' in e]
['some-text-to-get', 'another-text-to-get']

edited Jul 8, 2018 at 15:41

answered Jul 8, 2018 at 15:32

dawg

105k24 gold badges142 silver badges217 bronze badges

Comments

leamon · Accepted Answer · 2018-07-08 15:48:04Z

0

You could capture it using this regular expression:

((?:[a-z]+-)+[a-z]+)

[a-z]+ match one or more character
(?:[a-z]+-) don't capture in group

answered Jul 8, 2018 at 15:48

leamon

414 bronze badges

Collectives™ on Stack Overflow

regex : get part of text from url data

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related