Spacy - pdf_reader extraction of text only from specific pages

Question

could you please tell me what is wrong with below function. I would like to parse only first two pages of the pdf. When I call the function with argument page_numbers=[0,1] it extracts text from all pages anyway.

The function is very slow and I would like to limit number of pages parsed.

def spacy_extractor(label, pattern_name, list_name, pdf_path, pdf_name,
                    filtered_list,page_numbers):

    patterns = [{'label': label, 'pattern': pattern_name} for pattern_name in list_name]
    ruler.add_patterns(patterns)
    doc = pdf_reader(os.path.join(pdf_path, pdf_name), nlp, PdfminerParser, page_numbers)
    filtered_list = [ent.text for ent in doc.ents if ent.label_ == label]

    return filtered_list[0] if filtered_list else None

cover_page_legal_form = spacy_extractor(label='LEG', pattern_name= 'legal_form', list_name=legal_form_list,
                                         pdf_path=fs_path_pdf, pdf_name=fs_name_pdf, filtered_list='legal_forms_filtered',page_numbers=[0,1])

Thank you,

Damodhar · Accepted Answer · 2023-10-17 07:38:45Z

0

follow the links give document you can access the particular doc/page using

doc._.page_range method .

https://spacy.io/universe/project/spacypdfreader

answered Oct 17, 2023 at 7:38

Damodhar

1,3178 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spacy - pdf_reader extraction of text only from specific pages

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related