1

I've some problem with regex in python. I've some html pages which contain useful informantion for me. At the time the pages were saved the encodig charset was a kind of iso... which saved all the German typical letters encoded eg. like "Fr%C3%BCchte" for Früchte and son on. The html is really bad structured so that the only reasonably way to scrape it is using regex.

I've this regex in python:

re.compile('<a\s+href="javascript.*?\(\'(\w+).*?\s.(\d+.+\d+).*?(.*)\'\)\">')

unfortunately is not really exactly what I want, because the encoded words will be fetched only partially eg. the result will be:

[('showSubGroups', "160500', 'Fr%C3", '%BCchte in Alkohol'),
 ('showSubGroups', '160400', "', 'Rumtopf"),
 ('showSubGroups', '160300', "', 'Spirituosen (Bio)"),
 ('showSubGroups', '160200', "', 'Spirituosen zur Verarbeitung in der Confiserie"),
 ('showSubGroups', '160100', "', 'Spirituosen, allgemein")]

maybe I'm tired, but I can't see where is the error:

hir the html:

<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>                </tbody></table>
            </td>
        </tr>
3
  • 3
    the canonical response to this sort of question stackoverflow.com/a/1732454/735204 I would recommend against using regex to parse HTML - maybe consider instead a library like BeautifulSoup or lxml that allow the use of XPaths for HTML parsing Commented Aug 29, 2012 at 22:17
  • 1
    Hmm, that canonical response seems overly dramatic. And it may even be correct that you can't parse HTML with regex. But you can extract information from it. Which is kind of the point here. Commented Aug 29, 2012 at 22:37
  • 1
    @RolandSmith sure you can (for a limited subset, at least), the point is there are easier and better ways. Commented Aug 29, 2012 at 23:31

2 Answers 2

1

Try this:

f = re.compile("sendForm\((?:.*), (.*), (.*)\)")

With your text as input, it gives the following:

In [7]: f.findall(txt)
Out[7]:  [('160500', 'Fr%C3%BCchte in Alkohol'), ('160400', 'Rumtopf'), ('160300', 'Spirituosen (Bio)'), ('160200', 'Spirituosen zur Verarbeitung in der Confiserie'), ('160100', 'Spirituosen, allgemein')]

As far as decoding the %C3%BC (for 'ü') goes, it seems just to be UTF-8 from the Latin 1 block with some extra '%' thrown in, because it decodes if you replace the '%' with '\x':

In [39]: '\xC3\xBC'.decode('utf-8')
Out[39]: u'\xfc'

0x00FC is the unicode for ü.

Sign up to request clarification or add additional context in comments.

Comments

0

Beautiful Soup is a great library to parse html.

Once you have extracted the hrefs from the html, then using regex should be pretty easy.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.