0

I am trying to webscrape this website. To do so, I wrote the following code which works nicely:

from bs4 import BeautifulSoup
import pandas as pd
import requests

payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
    "content-type": "application/x-www-form-urlencoded",
    "X-Requested-With": "XMLHttpRequest"
    }

req=requests.post(url,headers=headers,data=payload)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
    r = BeautifulSoup(requests.get(f"https://www.bis.org{card.a.get('href')}").content)
    data.append({
        'date': card.select_one('.item_date').get_text(strip=True),
        'title': card.select_one('.title a').get_text(strip=True),
        'author': card.select_one('.authorlnk.dashed').get_text(strip=True),
        'url': f"https://www.bis.org{card.a.get('href')}",
        'text': r.select_one('#cmsContent').get_text('\n\n', strip=True)
        })

pd.DataFrame(data)

However, if you for example open the first link of the page, there is a pdf in it. I would like to add to my dataframe - whenever there is a pdf in the link - the content of the pdf.

To do so, I have been looking around and I tried the following only on the first pdf of the first link:

import io
from PyPDF2 import PdfFileReader


def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
info('https://www.bis.org/review/r220708e.pdf')
  

However, it just gets the info (which I already have from the previous code), while it is missing the text. Ideally, I would like it to be part of the same code as above. I got stuck here.

Can anyone help me with this?

Thanks!

1

2 Answers 2

3

You need to return it, e.g. as a tuple :

return txt, information

If you want the text inside the pdf:

text = ""
for page in pdf.pages:
    text += page.extract_text() + "\n"
Sign up to request clarification or add additional context in comments.

8 Comments

If I run pdf.extract_text(), I get AttributeError: 'PdfFileReader' object has no attribute 'extract_text'
Ah, sorry, pdf.pages[page_number].extract_text()
Now it gives: AttributeError: '_VirtualList' object has no attribute 'extract_text'
You probably missed to use the index e.g. pdf.pages[0].extract_text(). Please check the examples: pypdf2.readthedocs.io/en/latest/user/extract-text.html
it doesn't work in that way
|
1

I'll allow you the pleasure of adapting this to your requests, sync scraping fashion (really not hard):

from PyPDF2 import PdfReader
...
async def get_full_content(url):
    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
        if url[-3:] == 'pdf':
            r = await client.get(url)
            with open(f'{url.split("/")[-1]}', 'wb') as f:
                f.write(r.content)
                reader = PdfReader(f'{url.split("/")[-1]}')
                pdf_text = ''
                number_of_pages = len(reader.pages)
                for x in range(number_of_pages):
                    page = reader.pages[x]
                    text = page.extract_text()
                    pdf_text = pdf_text + text

And then you do something with the pdf_text extracted from .pdf (saving it into a db, reading it with pandas, nlp-ing it with Transformers/torch, etc).


Edit: one more thing: do a pip install -U pypdf2 as the package was recently updated (a few hours ago), just to make sure you're up to date.

Edit 2: A copy/pastable example, for a single .pdf file:

from PyPDF2 import PdfReader
import requests

url = 'https://www.bis.org/review/r220708e.pdf'

r = requests.get(url)
with open(f'{url.split("/")[-1]}', 'wb') as f:
    f.write(r.content)
    reader = PdfReader(f'{url.split("/")[-1]}')
    pdf_text = ''
    number_of_pages = len(reader.pages)
    for x in range(number_of_pages):
        page = reader.pages[x]
        text = page.extract_text()
        pdf_text = pdf_text + text
print(pdf_text)

2 Comments

When I run your code on my url, I get the following: <coroutine object get_full_content at 0x00000178294B61C8>. How can I extract text from this then? Thanks
The point is not to copy paste my code, but instead to read, understand and adapt it (that was my hope anyway). I updated my response with a copypastable example for a single .pdf file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.