2

I'm using the python library SGMLParser to parse some html. I encounter an html tag of the form

<td class="school">Texas A&amp;M</td>

I'd like to read out "Texas A&M". But when handle_data gets called, it gets called with "Texas A", and then, separately, "M" (quotes for clarity).

How do I replace the

&amp; 

string with an & before the call, without replacing all special ampersands in the whole string (some of which I may need).

Thanks!

3 Answers 3

4

If you switch from the deprecated SGMLParser to a modern alternative such as LXML (which also handles HTML), this becomes trivial:

>>> etree.fromstring('''<td class="school">Texas A&amp;M</td>''').text
'Texas A&M'
Sign up to request clarification or add additional context in comments.

2 Comments

SGMLParser is being deprecated because nobody cares about SGML (and most people use it to parse HTML, case in point). XMLParser has the same interface and is not being deprecated. lxml should really go into the stdlib.
Yes, I didn't care about SGML either, it just seemed like an "easy" way to read data from html. I will look into lxml, thanks.
2

SGMLParser has convert_entityref() method, but instead of deprecated SGMLParser I would recommend using lxml or Beautiful Soup which have better parser API.

Comments

1

Entity references like &amp; are handled by handle_entity. Check that this method knows how to translate &amp;. The default implementation should call handle_data('&'), but you may have accidentally overwritten it.

Also, if possible, consider using the far more advanced lxml instead.

2 Comments

I don't think I overwrote that... but then handle_data gets called three times with 'Texas A', '&', and 'M' right? Is there a way to have the data joined (if you know what I mean)? It looks like everyone suggests lxml, so I will look into it.
@mdeland Precisely. You have to join the data yourself; SGMLParser is a very low-level interface.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.