Bash/PHP extract URL from HTML via regex

Question

Is there any easy way to extract this URL in bash/or PHP?

http://shop.image-site.com/images/2/format2013/fullies/kju_product.png

From this HTML code?

<a href="javascript: open_window_zoom('http://shop.image-site.com/image.php?image=http://shop.image-site.com/images/2/format2013/fullies/kju_product.png&pID=31777&download=kju.png&name=13011 KELLYS Kju: 490mm (19.5&quot;)',550,366);">

Olaf Dietsche · Accepted Answer · 2013-02-27 22:54:27Z

2

With perl you could do a match and a capture

perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);'

This captures everything between image= and the next & and prints it $1.

For more on regular expressions, see perlre or http://www.regular-expressions.info/

answered Feb 27, 2013 at 22:54

Olaf Dietsche

74.4k9 gold badges113 silver badges214 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Adrian Over a year ago

You guys rock! Works like a charm. This is regex? It is seems easier. Sometimes I need regex, but it is really hard to learn. :)

Olaf Dietsche Over a year ago

@Adrian It's a skill well worth learning. Start with simple regular expressions and expand on that.

L0j1k Over a year ago

To second what Olaf said, it's one of the most powerful tools a programmer has.

jitendra · Accepted Answer · 2013-02-27 23:49:51Z

2

In bash, you can try the following:

sed 's/.*image=\(http:\/\/[^&]*\).*/\1/g'

Update:
The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.

edited Feb 27, 2013 at 23:49

answered Feb 27, 2013 at 22:50

jitendra

1,4582 gold badges19 silver badges40 bronze badges

5 Comments

L0j1k Over a year ago

Do you really need to match the beginning of the line and the end of the line?

jitendra Over a year ago

@L0j1k I didn't understand what do you mean by matching beginning of line and end of line. I didn't used ^ or $ in my solution.

L0j1k Over a year ago

Aloha. That's exactly right. And if you're going to use a substitution match (which will destroy the original data, something the asker may not know about), you should be using ^ and $. It all comes down to greedy matching, like sputnick said.

jitendra Over a year ago

@L0j1k Ok. Now, I understand what you meant.

L0j1k Over a year ago

Since you know what's up now, I want to undownvote this, but I can't unless you edit your answer. If you edit your answer to include a disclaimer that your answer will perform a substitution (and therefore destroy the original data), then I'll upvote you.

L0j1k · Accepted Answer · 2013-02-27 22:52:32Z

1

Whichever way you decide to dress it up, you could simply split with the delimiter equal to ?image= and then split the second token you receive (i.e. result[1]) with a simple & delimiter. The first result from that split is your answer.

However, a pure regex match would look something like: m#image=(a-z0-9\:/\.\-)&#i. You can take that regex and put it wherever you want to get your result stored in $1. Despite what a lot of people think, you do not have to match the beginning of a line and the end of a line to match a result.

answered Feb 27, 2013 at 22:52

L0j1k

12.7k7 gold badges58 silver badges65 bronze badges

Comments

Gilles Quénot · Accepted Answer · 2013-02-27 23:03:38Z

1

Try doing this :

xmllint --html --xpath '//a/@href' file://file.html |
    grep -oP 'image=\Khttp://.*?\.png'

You can use an URL instead of a local file :

http://domain.tld/path

Or if you had already extracted the line to parse in the $string variable :

grep -oP 'image=\Khttp://.*?\.png' <<< "$string"

edited Feb 27, 2013 at 23:03

answered Feb 27, 2013 at 22:55

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

Collectives™ on Stack Overflow

Bash/PHP extract URL from HTML via regex

4 Answers 4

3 Comments

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related