0
text=u’<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>‘ 

I am a python new hand. I wanna get \ue6ec、\ue6f6、\ue6ec,how to fetch these string use re module. Thank you very much!

1
  • wow, this fragment looks intentionally obfuscated. What does this actually come from? Commented Nov 26, 2010 at 7:43

4 Answers 4

2

Regexp is not good tool to work with HTML. Use the Beautiful Soup.

Sign up to request clarification or add additional context in comments.

Comments

2
>>> from BeautifulSoup import BeautifulSoup
>>> text=u'<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>'
>>> t = BeautifulSoup(text)
>>> t.findAll(text=True)
[u'\ue689', u'\ue6ec', u'\ue6f6']

3 Comments

And for reference, that produces u'\ue689\ue6ec\ue6f6'.
The lastest BeautifulSoup-3.0.0.py, there is not have getText() method,how to use it.Thank you .
Oops, did not notice - fixed now (and this is actually better since now you don't have to split it - if you want them in a single string, do ''.join(t.findAll(text=True)
1

Don't use regular expressions to parse HTML. Use BeautifulSoup. Documentation for BeautifulSoup.

Comments

0

If you know that the page will always have that format, use BeautifulSoup parser to find what you need in HTML.

However, sometimes BeautifulSoup may break due to malformed HTML. I'd suggest you to use lxml which is python binding of libxml2. It will parse and usually correct the malformed HTML.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.