Troubles with regex in Python

Question

I'm writing a short Python script that finds all the URLs that points to pictures hosted in Photobucket in a phpbb forum database dumb and pass them to a download manager (in my case Free Download Manager) in order to save the images in the local computer and then move them on another host (now Photobucket began to ask for a yearly subscription to embed in other sites the pictures hosted in its servers). I've managed to search all the pictures using a regex with lookarounds, when I tested my regex on two text editors with regex search support i found what I wanted but in my script it gives me troubles.

import re
import os

main_path = input("Enter a path to the input file:")
with open(main_path, 'r', encoding="utf8") as file:
    file_cont = file.read()
pattern = re.compile(r'(?!(<IMG src=""))http:\/\/i[0-9][0-9][0-9]\.photobucket\.com\/albums\/[^\/]*\/[^\/]*\/[^\/]*(?=("">))')
findings = pattern.findall(file_cont)
for finding in findings:
    print(finding)
os.system("pause")

I tried to debug it removing the download part and printing all the matches and I get a long list of ('', '"">') instead of URLs similar to this one: http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg Where I'm wrong?

Python's regex engine is probably different from theirs. I'd recommend testing it with regex101, which you can switch into python — TemporalWolf
– TemporalWolf, Commented Aug 27, 2017 at 10:28
You're right in other testing system it worked, regex101 in Python mode failed to match the strings. I will use it in future. — Emiliano S.
– Emiliano S., Commented Aug 27, 2017 at 13:53

Arount · Accepted Answer · 2017-08-27 10:41:21Z

1

Your regex pattern is not good.

I'm not sure what you tried to do and I would advise you to use BeautifulSoup instead of playing with regex if you needs to parse HTML (because Regex can not really parse HTML).

But anyway - with regex - this should works:

r'<IMG src=\"(https?:\/\/i[0-9]{3}\.photobucket\.com\/albums[^\"]+)\"[^>]+\/>'

The https?:\/\/i[0-9]{3}\.photobucket\.com\/albums is done to filter non photobucket images, [^\"]+ is more generic and just extract everything until the last " character of the attribute.

Example:

<IMG src="http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg" foo="bar"/>

Gives at .group(1):

http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg

edited Aug 27, 2017 at 10:41

answered Aug 27, 2017 at 10:35

Arount

10.5k1 gold badge35 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

shA.t · Accepted Answer · 2017-08-27 11:02:24Z

0

I think below version of your regex should work:
Note that I use \" instead of "" ,
and I replace img src with img.+src to support img alt="" src also,
and instead of [^\/]* I use [^\/]+ to remove validating of \\,
and for last part of URL I also check for not occurrence of ",
then instead of checking for > followed exactly after " I check optional other characters after " by .*.

(?!(<img.+src=\"))http:\/\/i\d{3}\.photobucket\.com\/albums\/[^\/]+\/[^\/]+\/[^\/\"]+(?=\".*/>)
                                                                                   ^^       ^^^

You can use \d\d\d or [0-9]{3} or \d{3} instead of [0-9][0-9][0-9],

[Regex Demo]

edited Aug 27, 2017 at 11:02

answered Aug 27, 2017 at 10:56

shA.t

17k5 gold badges59 silver badges121 bronze badges

Collectives™ on Stack Overflow

Troubles with regex in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related