2

I am trying to load with python langchain library an online pdf from: http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf

This is the code that I'm running locally:

loader = PyPDFLoader(datasheet_path)
pages  = loader.load_and_split()
Am getting the following error
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
Cell In[4], line 8
      6 datasheet_path = "http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf"
      7 loader = PyPDFLoader(datasheet_path)
----> 8 pages = loader.load_and_split()
     11 query = """

File ***\.venv\lib\site-packages\langchain\document_loaders\base.py:36, in BaseLoader.load_and_split(self, text_splitter)
     34 else:
     35     _text_splitter = text_splitter
---> 36 docs = self.load()
     37 return _text_splitter.split_documents(docs)
...
   (...)
    114         for i, page in enumerate(pdf_reader.pages)
    115     ]

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\****\\AppData\\Local\\Temp\\tmpu_59ngam'

Note1: running the same code in google Colab works well Note2: running the following code in the same notebook is working correctly so I'm not sure access to the temp folder is problematic in any manner:

with open('C:\\Users\\benis\\AppData\\Local\\Temp\\test.txt', 'w') as h:
    h.write("test")

Note3: I have tested several different online pdf. got same error for all.

The code should covert pdf to text and split to pages using Langchain and pyplot

4 Answers 4

2

You will not succeed with this task using langchain on windows with their current implementation. You can take a look at the source code here. Consider the following abridged code:

class BasePDFLoader(BaseLoader, ABC):
    def __init__(self, file_path: str):
        ...
        # If the file is a web path, download it to a temporary file, and use that
        if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
            r = requests.get(self.file_path)

            ...
            self.web_path = self.file_path
            self.temp_file = tempfile.NamedTemporaryFile()
            self.temp_file.write(r.content)
            self.file_path = self.temp_file.name
            ...

    def __del__(self) -> None:
        if hasattr(self, "temp_file"):
            self.temp_file.close()

Note that they open the file in the constructor, and close it in the destructor. Now let's look at the python documentation on NamedTemporaryFile (emphasis mine, docs are for python3.9):

This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).

Sign up to request clarification or add additional context in comments.

Comments

0

Update the pdf.py (https://github.com/hwchase17/langchain/blob/5cfa72a130f675c8da5963a11d416f553f692e72/langchain/document_loaders/pdf.py#L65) file and make the NamedTemporaryFile not deletable (until the application exits)

self.temp_file = tempfile.NamedTemporaryFile(delete=False)

Reference: https://stackoverflow.com/questions/3924117/how-to-use-tempfile-namedtemporaryfile-in-python#:~:text=To%20fix%20this%20use%3A%20tf%20%3D%20tempfile.NamedTemporaryFile%20%28delete%3DFalse%29,won%27t%20let%20you%20open%20it%20using%20another%20application.

alternatively, this is a PR that is open in langchain: https://github.com/hwchase17/langchain/pull/5887/files

Comments

0

you need pypdf installed

pip install pypdf -q

write a reusable def to load pdf

def load_doc(file):
    from langchain.document_loaders import PyPDFLoader
    loader=PyPDFLoader(file)
    pages  = loader.load_and_split()
    print("pages",pages)
    return loader.load()
    

if you work locally, you pass the destination of the file as the file arg. but if you want to load online pdf, you pass the url

data=load_doc('https://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf')
print(data[1].page_content)
print(data[10].metadata)
print(f'you have {len(data)} pages in your data')
print(f'there are {len(data[20].page_content)} characters in the page')

Comments

-1

You can just conver the argument which you are passing in the PyPDFLoader into string like

loader = PyPDFLoader(file_path=str(pdf_path))

1 Comment

What makes you think the file_path isn't already a string? How will this help with the permission error described in the question?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.