1

I've been thinking about this all day and i need helt to solve it.

I've got the html below and wants to extract all values of the query parameter matching "?imgurl=". Can anybody help me out with the regex for this?

</script></div><div id=nr_container><div id=center_col><div id=tbbcc><div id=tbbc style="background:#ebeff9;margin-bottom:4px;padding:8px;display:none"></div></div><div id=res class=med role=main><div id=topstuff></div><!--a--><h2 class=hd>Søgeresultater</h2><div id=ires><ol><script>google.isr.fillCanvas=function(i){var c=document.getElementById('cvs_'+i.id);try{c&&(c.getContext('2d').drawImage(i,0,0,c.offsetWidth,c.offsetHeight));}catch(e){c.style.display='none';i.style.display='block';}}</script><div id=rgsh_s></div><li><div id=rg><div id=rg_s><div id=rg_hp><a id=rg_hpl></a></div><div class=rg_h id=rg_h><div class=rg_hc><a class=rg_hl id=rg_hl><img class=rg_hi id=rg_hi></a><div class=std id=rg_hx><p class=rg_ht id=rg_ht><a id=rg_hta></a></p><p class=rg_hn id=rg_hn></p><p class=rg_hr><span id=rg_hr></span></p><p class=rg_ha><span id=rg_ha><a class=rg_hal id=rg_hals></a><span id=rg_has>&nbsp;&#8209;&nbsp;</span><a class=rg_hal id=rg_haln></a><span id=rg_has2>&nbsp;&#8209;&nbsp;</span><a class=rg_hal id=rg_halm></a></span></p></div></div></div><span class=rg_ctlv><ul class=rg_ul data-pg=1 data-cnt=44><li class=rg_li data-row=1 style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.eecs.berkeley.edu/~loarie/test.colors.gif&amp;imgrefurl=http://s1mon.smartlog.dk/test-post37556&amp;usg=__xdES-qA3W9Np6DMNDs0HPTe2Bn8=&amp;h=606&amp;w=807&amp;sz=18&amp;hl=da&amp;start=1&amp;zoom=1&amp;tbnid=sFzpf2rpdeVHLM:&amp;tbnh=107&amp;tbnw=143&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_sFzpf2rpdeVHLM:l" style="display:block" width=193 height=145></canvas><img class=rg_i id=sFzpf2rpdeVHLM:l height=145 width=193 style="width:193px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:154px;height:145px" ><a class=rg_l style="width:160px;height:145px;margin-top:0px;margin-left:-2px" href="/imgres?imgurl=http://www.krymmel.dk/dev/media/.jkforum/test-pilot.png&amp;imgrefurl=http://www.krymmel.dk/dev/pages/forum.php&amp;usg=__a-KJQiDnKKy8LxlCV-d3XZpKGuw=&amp;h=327&amp;w=360&amp;sz=110&amp;hl=da&amp;start=2&amp;zoom=1&amp;tbnid=KLm4Rocmahp8wM:&amp;tbnh=110&amp;tbnw=121&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_KLm4Rocmahp8wM:l" style="display:block" width=160 height=145></canvas><img class=rg_i id=KLm4Rocmahp8wM:l height=145 width=160 style="width:160px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:148px;height:145px" ><a class=rg_l style="width:148px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://colorvisiontesting.com/plate%2520with%25205.jpg&amp;imgrefurl=http://colorvisiontesting.com/ishihara.htm&amp;usg=__UfBI8sd8ldLjjiK3-7aGJo0zKy4=&amp;h=309&amp;w=315&amp;sz=142&amp;hl=da&amp;start=3&amp;zoom=1&amp;tbnid=2_UMDol8AQhejM:&amp;tbnh=115&amp;tbnw=117&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_2_UMDol8AQhejM:l" style="display:block" width=148 height=145></canvas><img class=rg_i id=2_UMDol8AQhejM:l height=145 width=148 style="width:148px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://pun.org/josh/archives/04.10.01.GlobalTest-X.gif&amp;imgrefurl=http://hovedstaden.inetgiant.dk/fredensborg/AdDetails/test/3187460&amp;usg=___4P_UDkeMuovXCIjq-PY9WhG1Vw=&amp;h=391&amp;w=520&amp;sz=44&amp;hl=da&amp;start=4&amp;zoom=1&amp;tbnid=l15zkNo3p4iYcM:&amp;tbnh=99&amp;tbnw=131&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_l15zkNo3p4iYcM:l" style="display:block" width=193 height=145></canvas><img class=rg_i id=l15zkNo3p4iYcM:l height=145 width=193 style="width:193px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:139px;margin-top:3px;margin-left:0px" href="/imgres?imgurl=http://www.daimi.au.dk/~rvinge/Test_daimi.jpg&amp;imgrefurl=http://www.daimi.au.dk/~rvinge/Hot.list.html&amp;usg=__ofrC4G4FpZgXi95enpnIG4Wpdlg=&amp;h=881&amp;w=1223&amp;sz=228&amp;hl=da&amp;start=5&amp;zoom=1&amp;tbnid=WDreIpjcKhg13M:&amp;tbnh=108&amp;tbnw=150&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_WDreIpjcKhg13M:l" style="display:block" width=193 height=139></canvas><img class=rg_i id=WDreIpjcKhg13M:l height=139 width=193 style="width:193px;height:139px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:143px;height:145px" ><a class=rg_l style="width:145px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.textually.org/tv/archives/images/set3/test-pattern-clock_4767.jpg&amp;imgrefurl=http://hovedstaden.inetgiant.dk/fredensborg/AdDetails/test/3187460&amp;usg=__BFaPejcst7ygnE72uTI6sJKxmIk=&amp;h=308&amp;w=307&amp;sz=18&amp;hl=da&amp;start=6&amp;zoom=1&amp;tbnid=m1QYUHLkZ-mXCM:&amp;tbnh=117&amp;tbnw=117&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_m1QYUHLkZ-mXCM:l" style="display:block" width=145 height=145></canvas><img class=rg_i id=m1QYUHLkZ-mXCM:l height=145 width=145 style="width:145px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:118px;height:145px" ><a class=rg_l style="width:118px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://imgs.xkcd.com/comics/turing_test.png&amp;imgrefurl=http://xkcd.com/329/&amp;usg=__DdATXOcoguD2UbYUMs_iwi4r54I=&amp;h=394&amp;w=320&amp;sz=22&amp;hl=da&amp;start=7&amp;zoom=1&amp;tbnid=UeYWZFjYErEM6M:&amp;tbnh=124&amp;tbnw=101&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_UeYWZFjYErEM6M:l" style="display:block" width=118 height=145></canvas><img class=rg_i id=UeYWZFjYErEM6M:l height=145 width=118 style="width:118px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:133px;height:145px" ><a class=rg_l style="width:149px;height:145px;margin-top:0px;margin-left:-4px" href="/imgres?imgurl=http://thomasdamgaard.dk/blog/images/test01.jpg&amp;imgrefurl=http://thomasdamgaard.dk/blog/test-skilt-pa-motorvejen&amp;usg=__quqWeHGs6OFAggLm5DBauetlRQU=&amp;h=487&amp;w=500&amp;sz=22&amp;hl=da&amp;start=8&amp;zoom=1&amp;tbnid=HwAHMYrtavz5IM:&amp;tbnh=127&amp;tbnw=130&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_HwAHMYrtavz5IM:l" style="display:block" width=149 height=145></canvas><img class=rg_i id=HwAHMYrtavz5IM:l height=145 width=149 style="width:149px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:100px;height:145px" ><a class=rg_l style="width:102px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.ct4me.net/images/dmbtest.gif
2
  • What language/platform are you using? Commented Jan 25, 2011 at 21:17
  • 3
    Maybe reformat the html so it can be more readable and an example result of what you would expect to be extracted. Commented Jan 25, 2011 at 21:17

2 Answers 2

1

It irritates me that people are so quick to jump on the don't use regex to parse HTML. You're not really parsing HTML here anyway. Even if you use the Html Agility Pack to extract the URLs from your html, you're still going to need to pull the imgurl parameters out of each query string.

Regex is perfect for extracting parameters from a query string, and this will do what you want:

string input = "your big HTML string";
MatchCollection matches = Regex.Matches(
    input, 
    @"(?<=[?&]imgurl=)[^&#'"]*", 
    RegexOptions.IgnoreCase // remove this if you don't want to ignore case in "imgurl"
);

I'm all for using the HTML Agility Pack for actually parsing HTML, but if you just want to strip a few strings (which fit a well defined pattern) out of a bigger string, there's no better tool for the job than regex. The reason it's bad to use regex to parse HTML tags, is that HTML isn't reliably structured. A URL's query string has to be in a particular format, so it's safe to use regex.

Sign up to request clarification or add additional context in comments.

1 Comment

This was initially what i was looking for - the simple strip teh image urls. Considering my comment to Oded's answer, it's javascript and therefore RegEx works just fine. I was indeed looking for the pragmatic solution on this one, since it's a prototype.
1

Don't use regex to parse HTML.

See here for a compelling demonstration of why.

Use an HTML parser for you platform/language.


Edit:

As you have indicated use of C#, I suggest using the HTML Agility Pack - it is widely used and can be queried with XPath, like XmlDocument.

For your particular need, I would get all links and for each use string.Split to get the query string parameters you need.

4 Comments

Okey. That said. I'm using c# so could you throw me the outline of a parser?
Use the HTML Agility Pack to extract the URLs, then simply use string.Split to get the parameters you need.
What happens if some of the urls in question are part of javascript code in the HTML?
It turns out that it is not HTML. It is the webResponse from a Google Image Search, which actually is a crapload of javascript. The results are then first rendered in the browser. The only way i could figure this out was to debug to the response and the copy it into notepad. In Fiddler2 the response is encoded (can not be seen in clear text).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.