0

I would like to iterate through PDF links saved in python dataframe. The goal is to open the PDF links, save the PDFs and extract text from them, then save the text from each corresponding link in a new column.

Dataframe looks like this:

    URL
0   https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf
1   https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.pdf
2   https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-2_Full-Application-Template_Senegal-RMNCAH-Activity_English-Version.docx
3   https://westafricatradehub.com/wp-content/uploads/2021/07/Pi%C3%A8ce-Jointe-2_Mod%C3%A8le-de-Demande-Complet_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.docx
4   https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-3_Trade-Hub-Performance-Indicators-Table.xlsx
5   https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-10_Project-Budget-Template-RMNCAH.xlsx
6   https://westafricatradehub.com/wp-content/uploads/2021/08/Senegal-Health-RFA-Webinar-QA.pdf
7   https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Catalytic-Business-Concepts-Round-2.pdf
8   https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Concepts-d%E2%80%99Affaires-Catalytiques-2ieme-Tour.pdf
9   https://westafricatradehub.com/wp-content/uploads/2021/06/APS-WATIH-1247_Research-Development-Round-2.pdf

I was able to do that for one link but not for the whole dataframe

import urllib.request
pdf_link = "https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf"

def download_file(download_url, filename):
    response = urllib.request.urlopen(download_url)    
    file = open(filename + ".pdf", 'wb')
    file.write(response.read())
    file.close()
 
download_file(pdf_link, "Test")

#Code to extract text from PDF 

import textract
text = textract.process("/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/Test.PDF")
print(text)

Thank you!

1 Answer 1

1

Here you go:

import urllib.request
import textract

def download_file(download_url, filename):
    response = urllib.request.urlopen(download_url)    
    file = open(filename + ".pdf", 'wb')
    file.write(response.read())
    file.close()

df['Text']=''

for i in range(df.shape[0]):
    pdf_link=df.iloc[i,0]
    download_file(pdf_link, f"pdf_{i}")
    text = textract.process(f"/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/pdf_{i}.PDF")
    df['Text'][i]=text
Sign up to request clarification or add additional context in comments.

5 Comments

Hey thanks a bunch !!! but I get the following error: ShellError: The command pdf2txt.py /Users/fatimazahraelmansouri/pdf_2.PDF failed with exit code 1 ------------- stdout ------------- b''------------- stderr ------------- Any ideas why ?
I think I figured it out ! the third link is a docx not a pdf ! Any ideas how to extract text from a docx doc ? Thanks again !!! you're extremely helpful
A docs is already a text document so it should be easy. Check python-docx.readthedocs.io/en/latest .
Thanks ! this helped me tremendously advance in a very import professional project. Very grateful for people willing to help like yourself ! you're the real OG !!
Could you please solve this problem as well? stackoverflow.com/questions/69395489/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.