scraping html with python regex

Question

I've some problem with regex in python. I've some html pages which contain useful informantion for me. At the time the pages were saved the encodig charset was a kind of iso... which saved all the German typical letters encoded eg. like "Fr%C3%BCchte" for Früchte and son on. The html is really bad structured so that the only reasonably way to scrape it is using regex.

I've this regex in python:

re.compile('<a\s+href="javascript.*?\(\'(\w+).*?\s.(\d+.+\d+).*?(.*)\'\)\">')

unfortunately is not really exactly what I want, because the encoded words will be fetched only partially eg. the result will be:

[('showSubGroups', "160500', 'Fr%C3", '%BCchte in Alkohol'),
 ('showSubGroups', '160400', "', 'Rumtopf"),
 ('showSubGroups', '160300', "', 'Spirituosen (Bio)"),
 ('showSubGroups', '160200', "', 'Spirituosen zur Verarbeitung in der Confiserie"),
 ('showSubGroups', '160100', "', 'Spirituosen, allgemein")]

maybe I'm tired, but I can't see where is the error:

hir the html:

<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>                </tbody></table>
            </td>
        </tr>

the canonical response to this sort of question stackoverflow.com/a/1732454/735204 I would recommend against using regex to parse HTML - maybe consider instead a library like BeautifulSoup or lxml that allow the use of XPaths for HTML parsing — Emmett Butler
– Emmett Butler, Commented Aug 29, 2012 at 22:17
Hmm, that canonical response seems overly dramatic. And it may even be correct that you can't parse HTML with regex. But you can extract information from it. Which is kind of the point here. — Roland Smith
– Roland Smith, Commented Aug 29, 2012 at 22:37
@RolandSmith sure you can (for a limited subset, at least), the point is there are easier and better ways. — Hamish
– Hamish, Commented Aug 29, 2012 at 23:31

Roland Smith · Accepted Answer · 2012-08-29 22:51:57Z

1

Try this:

f = re.compile("sendForm\((?:.*), (.*), (.*)\)")

With your text as input, it gives the following:

In [7]: f.findall(txt)
Out[7]:  [('160500', 'Fr%C3%BCchte in Alkohol'), ('160400', 'Rumtopf'), ('160300', 'Spirituosen (Bio)'), ('160200', 'Spirituosen zur Verarbeitung in der Confiserie'), ('160100', 'Spirituosen, allgemein')]

As far as decoding the %C3%BC (for 'ü') goes, it seems just to be UTF-8 from the Latin 1 block with some extra '%' thrown in, because it decodes if you replace the '%' with '\x':

In [39]: '\xC3\xBC'.decode('utf-8')
Out[39]: u'\xfc'

0x00FC is the unicode for ü.

edited Aug 29, 2012 at 22:51

answered Aug 29, 2012 at 22:26

Roland Smith

43.8k3 gold badges69 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

varunl · Accepted Answer · 2012-08-29 22:24:58Z

0

Beautiful Soup is a great library to parse html.

Once you have extracted the hrefs from the html, then using regex should be pretty easy.

answered Aug 29, 2012 at 22:24

varunl

20.4k5 gold badges33 silver badges47 bronze badges

Collectives™ on Stack Overflow

scraping html with python regex

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related