Webscraping pdfs in Python in multiple links

Question

I am trying to webscrape this website. To do so, I wrote the following code which works nicely:

from bs4 import BeautifulSoup
import pandas as pd
import requests

payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
    "content-type": "application/x-www-form-urlencoded",
    "X-Requested-With": "XMLHttpRequest"
    }

req=requests.post(url,headers=headers,data=payload)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
    r = BeautifulSoup(requests.get(f"https://www.bis.org{card.a.get('href')}").content)
    data.append({
        'date': card.select_one('.item_date').get_text(strip=True),
        'title': card.select_one('.title a').get_text(strip=True),
        'author': card.select_one('.authorlnk.dashed').get_text(strip=True),
        'url': f"https://www.bis.org{card.a.get('href')}",
        'text': r.select_one('#cmsContent').get_text('\n\n', strip=True)
        })

pd.DataFrame(data)

However, if you for example open the first link of the page, there is a pdf in it. I would like to add to my dataframe - whenever there is a pdf in the link - the content of the pdf.

To do so, I have been looking around and I tried the following only on the first pdf of the first link:

import io
from PyPDF2 import PdfFileReader


def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
info('https://www.bis.org/review/r220708e.pdf')

However, it just gets the info (which I already have from the previous code), while it is missing the text. Ideally, I would like it to be part of the same code as above. I got stuck here.

Can anyone help me with this?

Thanks!

Please use PdfReader instead of PdfFileReader. Also, please use .metadata instead of .getDocumentInfo(). See: pypdf2.readthedocs.io/en/latest/meta/CHANGELOG.html#id55 - instead of pdf.getNumPages() you should use len(pdf.pages). — Martin Thoma
– Martin Thoma, Commented Jul 10, 2022 at 19:00

Martin Thoma · Accepted Answer · 2022-07-10 18:56:15Z

3

You need to return it, e.g. as a tuple :

return txt, information

If you want the text inside the pdf:

text = ""
for page in pdf.pages:
    text += page.extract_text() + "\n"

edited Jul 10, 2022 at 18:56

answered Jul 10, 2022 at 18:22

Martin Thoma

139k174 gold badges687 silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Rollo99 Over a year ago

If I run pdf.extract_text(), I get AttributeError: 'PdfFileReader' object has no attribute 'extract_text'

Martin Thoma Over a year ago

Ah, sorry, pdf.pages[page_number].extract_text()

Rollo99 Over a year ago

Now it gives: AttributeError: '_VirtualList' object has no attribute 'extract_text'

Martin Thoma Over a year ago

You probably missed to use the index e.g. pdf.pages[0].extract_text(). Please check the examples: pypdf2.readthedocs.io/en/latest/user/extract-text.html

Rollo99 Over a year ago

it doesn't work in that way

|

Barry the Platipus · Accepted Answer · 2022-07-17 17:10:19Z

1

I'll allow you the pleasure of adapting this to your requests, sync scraping fashion (really not hard):

from PyPDF2 import PdfReader
...
async def get_full_content(url):
    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
        if url[-3:] == 'pdf':
            r = await client.get(url)
            with open(f'{url.split("/")[-1]}', 'wb') as f:
                f.write(r.content)
                reader = PdfReader(f'{url.split("/")[-1]}')
                pdf_text = ''
                number_of_pages = len(reader.pages)
                for x in range(number_of_pages):
                    page = reader.pages[x]
                    text = page.extract_text()
                    pdf_text = pdf_text + text

And then you do something with the pdf_text extracted from .pdf (saving it into a db, reading it with pandas, nlp-ing it with Transformers/torch, etc).

Edit: one more thing: do a pip install -U pypdf2 as the package was recently updated (a few hours ago), just to make sure you're up to date.

Edit 2: A copy/pastable example, for a single .pdf file:

from PyPDF2 import PdfReader
import requests

url = 'https://www.bis.org/review/r220708e.pdf'

r = requests.get(url)
with open(f'{url.split("/")[-1]}', 'wb') as f:
    f.write(r.content)
    reader = PdfReader(f'{url.split("/")[-1]}')
    pdf_text = ''
    number_of_pages = len(reader.pages)
    for x in range(number_of_pages):
        page = reader.pages[x]
        text = page.extract_text()
        pdf_text = pdf_text + text
print(pdf_text)

edited Jul 17, 2022 at 17:10

answered Jul 10, 2022 at 21:15

Barry the Platipus

10.5k2 gold badges9 silver badges33 bronze badges

2 Comments

Rollo99 Over a year ago

When I run your code on my url, I get the following: <coroutine object get_full_content at 0x00000178294B61C8>. How can I extract text from this then? Thanks

Barry the Platipus Over a year ago

The point is not to copy paste my code, but instead to read, understand and adapt it (that was my hope anyway). I updated my response with a copypastable example for a single .pdf file.

Collectives™ on Stack Overflow

Webscraping pdfs in Python in multiple links

2 Answers 2

8 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related