Transitioning HTMLParser in Python3.x from Python2.x

Question

I'm trying to replicate the pdftables code from :
https://github.com/jeremyjbowers/pdftable/blob/master/pdftable.py
in Python3 but I am facing some compatibility issues especially with HTMLparser and associated functions.
In the below code: how do I replicate the functions of save_bgn and save_end or what is the replacement for the same in Python3.4

def __init__(self, extractor, rows, columns):
        self.extractor = extractor
        self.set = extractor.set
        self.rows = rows
        self.columns = columns
        self.html_parser = html.parser.HTMLParser(None)
def filter(self, str):
        str = re.sub(r'<[^>]+>', '', str)
        self.set.html_parser.save_bgn()
        self.set.html_parser.feed(str)
        return self.set.html_parser.save_end()

Any help will be appreciated. Thanks.

Terry Jan Reedy · Accepted Answer · 2016-02-05 22:02:40Z

As I understand, pdftable.py converts a pdf table to a .csv file by using html as an intermediary.

Since pdftable uses htmllib, which was deprecated in 2.6 in favor of the HTMLParser module, your problem is not with the transition from 2.x HTMLParser.HTMLParser to 3.x html.parser.HTMLParser, but with the transition from 2.x htmllib.HTMLParser to 2.x HTMLParser.HTMLParser. Even though the class name remained HTMLParser, the API is quite different for everything other than the .feed(text) method. In order to rewrite htmllib code, one must understand what it is doing as mechanical replacement is not possible.

For htmllib, the signature is HTMLParser(formatter), where formatter is expected to be one of the classes in the formatter module, or a subclass thereof. (The formatter module was deprecated in 3.4 since the removal of htmllib left is pretty much unused.) The intention is that one instantiate a subclass of HTMLParser with added tag methods. However, pdftable uses an empty parser.

    self.html_parser = htmllib.HTMLParser(None)

In the first line of filter,

    str = re.sub(r'<[^>]+>', '', str)

the regex appears to match tags and the substitution removes them. For the next three lines

    self.set.html_parser.save_bgn()
    self.set.html_parser.feed(str)
    return self.set.html_parser.save_end()

save_bgn() says to begin "saving character data in a buffer instead of sending it to the formatter object." (Good thing, since there is no formatter.) With no tags and no tag methods, I don't know what feeding the string through the parser does. I would not be surprised if the answer is "Nothing". If so, your answer would be to remove the three lines, and possibly def filter, replacing the filter() call by the re.sub call.

To find out, I suggest adding some 2.x print statements to filter, and then run on 2.7 with your example pdf file.

def filter(self, str):
    print 'Before replace:', str
    str = re.sub(r'<[^>]+>', '', str)
    print 'After replace:', str
    self.set.html_parser.save_bgn()
    print 'After parse:', str

    self.set.html_parser.feed(str)
    return self.set.html_parser.save_end()

Your suggestion to remove those 3 lines and use the re.sub as the call totally worked. Turns out they aren't doing anything that i need to replicate on a 3.x version. Much thanks!

Collectives™ on Stack Overflow

Transitioning HTMLParser in Python3.x from Python2.x

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related