0

I'm writing a short Python script that finds all the URLs that points to pictures hosted in Photobucket in a phpbb forum database dumb and pass them to a download manager (in my case Free Download Manager) in order to save the images in the local computer and then move them on another host (now Photobucket began to ask for a yearly subscription to embed in other sites the pictures hosted in its servers). I've managed to search all the pictures using a regex with lookarounds, when I tested my regex on two text editors with regex search support i found what I wanted but in my script it gives me troubles.

import re
import os

main_path = input("Enter a path to the input file:")
with open(main_path, 'r', encoding="utf8") as file:
    file_cont = file.read()
pattern = re.compile(r'(?!(<IMG src=""))http:\/\/i[0-9][0-9][0-9]\.photobucket\.com\/albums\/[^\/]*\/[^\/]*\/[^\/]*(?=("">))')
findings = pattern.findall(file_cont)
for finding in findings:
    print(finding)
os.system("pause")

I tried to debug it removing the download part and printing all the matches and I get a long list of ('', '"">') instead of URLs similar to this one: http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg Where I'm wrong?

2
  • Python's regex engine is probably different from theirs. I'd recommend testing it with regex101, which you can switch into python Commented Aug 27, 2017 at 10:28
  • You're right in other testing system it worked, regex101 in Python mode failed to match the strings. I will use it in future. Commented Aug 27, 2017 at 13:53

2 Answers 2

1

Your regex pattern is not good.

I'm not sure what you tried to do and I would advise you to use BeautifulSoup instead of playing with regex if you needs to parse HTML (because Regex can not really parse HTML).


But anyway - with regex - this should works:

r'<IMG src=\"(https?:\/\/i[0-9]{3}\.photobucket\.com\/albums[^\"]+)\"[^>]+\/>'

The https?:\/\/i[0-9]{3}\.photobucket\.com\/albums is done to filter non photobucket images, [^\"]+ is more generic and just extract everything until the last " character of the attribute.

Example:

<IMG src="http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg" foo="bar"/>

Gives at .group(1):

http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg
Sign up to request clarification or add additional context in comments.

Comments

0

I think below version of your regex should work:
Note that I use \" instead of "" ,
and I replace img src with img.+src to support img alt="" src also,
and instead of [^\/]* I use [^\/]+ to remove validating of \\,
and for last part of URL I also check for not occurrence of ",
then instead of checking for > followed exactly after " I check optional other characters after " by .*.

(?!(<img.+src=\"))http:\/\/i\d{3}\.photobucket\.com\/albums\/[^\/]+\/[^\/]+\/[^\/\"]+(?=\".*/>)
                                                                                   ^^       ^^^

You can use \d\d\d or [0-9]{3} or \d{3} instead of [0-9][0-9][0-9],

[Regex Demo]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.