Open, save and extract text PDFs from links in python dataframe

Question

I would like to iterate through PDF links saved in python dataframe. The goal is to open the PDF links, save the PDFs and extract text from them, then save the text from each corresponding link in a new column.

Dataframe looks like this:

    URL
0   https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf
1   https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.pdf
2   https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-2_Full-Application-Template_Senegal-RMNCAH-Activity_English-Version.docx
3   https://westafricatradehub.com/wp-content/uploads/2021/07/Pi%C3%A8ce-Jointe-2_Mod%C3%A8le-de-Demande-Complet_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.docx
4   https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-3_Trade-Hub-Performance-Indicators-Table.xlsx
5   https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-10_Project-Budget-Template-RMNCAH.xlsx
6   https://westafricatradehub.com/wp-content/uploads/2021/08/Senegal-Health-RFA-Webinar-QA.pdf
7   https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Catalytic-Business-Concepts-Round-2.pdf
8   https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Concepts-d%E2%80%99Affaires-Catalytiques-2ieme-Tour.pdf
9   https://westafricatradehub.com/wp-content/uploads/2021/06/APS-WATIH-1247_Research-Development-Round-2.pdf

I was able to do that for one link but not for the whole dataframe

import urllib.request
pdf_link = "https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf"

def download_file(download_url, filename):
    response = urllib.request.urlopen(download_url)    
    file = open(filename + ".pdf", 'wb')
    file.write(response.read())
    file.close()
 
download_file(pdf_link, "Test")

#Code to extract text from PDF 

import textract
text = textract.process("/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/Test.PDF")
print(text)

Thank you!

kubatucka · Accepted Answer · 2021-09-30 14:52:05Z

1

Here you go:

import urllib.request
import textract

def download_file(download_url, filename):
    response = urllib.request.urlopen(download_url)    
    file = open(filename + ".pdf", 'wb')
    file.write(response.read())
    file.close()

df['Text']=''

for i in range(df.shape[0]):
    pdf_link=df.iloc[i,0]
    download_file(pdf_link, f"pdf_{i}")
    text = textract.process(f"/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/pdf_{i}.PDF")
    df['Text'][i]=text

answered Sep 30, 2021 at 14:52

kubatucka

5755 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Fatima El Mansouri Over a year ago

Hey thanks a bunch !!! but I get the following error: ShellError: The command pdf2txt.py /Users/fatimazahraelmansouri/pdf_2.PDF failed with exit code 1 ------------- stdout ------------- b''------------- stderr ------------- Any ideas why ?

Fatima El Mansouri Over a year ago

I think I figured it out ! the third link is a docx not a pdf ! Any ideas how to extract text from a docx doc ? Thanks again !!! you're extremely helpful

kubatucka Over a year ago

A docs is already a text document so it should be easy. Check python-docx.readthedocs.io/en/latest .

Fatima El Mansouri Over a year ago

Thanks ! this helped me tremendously advance in a very import professional project. Very grateful for people willing to help like yourself ! you're the real OG !!

Fatima El Mansouri Over a year ago

Could you please solve this problem as well? stackoverflow.com/questions/69395489/…

Collectives™ on Stack Overflow

Open, save and extract text PDFs from links in python dataframe

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related