0

I was given help with the splitlines() function which worked perfect on string output which wasn't seperated by page numbers, see How to Create Spark or Pandas Dataframe from str output in Apache Spark on Databricks

I am now using str_output = result.pages as opposed to str_output = result.content

Now, when I execute

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})
df_data

I get the following error:

AttributeError: 'list' object has no attribute 'splitlines'

I think its because of the way that I'm using the splitlines function, but I'm not sure.

Any help appreciated

I should show the full code, see below:

import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient

# field_list = ["result.content"]

document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = document_analysis_client.begin_analyze_document_from_url(
            "prebuilt-read", blob_url)
  result = poller.result()
  print("Scanning " + blob.name + "...")
  print ("document contains", result.content)

myoutput = result.pages

df_data = pd.DataFrame({'RAWTEXT':myoutput.splitlines()})
df_data

As resuesting, a sample of the data is as follows:

Scanning 05Jul11 Raet Prelim.pdf... document contains PRELIMINARY REPORT RAET HOLDING B.V. 5 JULY 2011 1 RæT CONTENTS 1 INVESTMENT PROPOSAL ............................................................................................................ 5 1.1 Background to business................................................................................................................ 5 1.2 Process ........................................................................................................................................ 6 1.2.1 Overview .............................................................................................................................. 6 1.2.2 Due Diligence ....................................................................................................................... 7 1.2.3 Banking / Financing .............................................................................................................. 8 1.2.4 Proposed Tactics / Recommendation .................................................................................... 8 1.3 Investment Overview .................................................................................................................... 9 1.3.1 Investment thesis .................................................................................................................. 9 1.3.2 Business Strengths ............................................................................................................... 9 1.3.3 Investment Case Returns .....................................................................................................11 1.4 Key judgment calls ......................................................................................................................12 1.5 Recommendation ........................................................................................................................18 2 MARKET AND BUSINESS

2
  • The issue is that you're expecting str_output to be a string, but it's actually a list. You probably want a for loop like for page in result.pages: and to use page.splitlines() rather than str_output.splitlines(). Inserting a print(type(str_output)) might also clarify things. Commented Jun 8, 2022 at 14:09
  • Hi Sarah, thanks so much for reaching out. I should point out that my coding skills aren't as advanced as your skills. I have updated the question with the fulll code. If you could show me where I ought to make the amendments that would be most helpful. Sorry for being lazy, but I need to produce some results quickly for my manager Commented Jun 8, 2022 at 14:16

1 Answer 1

0

Here str_output is a list while splitlines() is a function for string objects. If you just pass str_output as a value in the dictionary you shouldn't face this error.

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})

If this doesn't help then please put a sample of the data in str_output in the question.

Sign up to request clarification or add additional context in comments.

5 Comments

Zero, I did the following df_data = pd.DataFrame({'RAWTEXT':myoutput}) and I got the following output 0 DocumentPage(kind=document, page_number=1, ang... 1 DocumentPage(kind=document, page_number=2, ang...
Any further help much appreciated.
@Patterson Please add this to the question for more clarity.
Zero, I have added some sample output. Just so you, there are 45 pages in total for this particular document
Hi, did my sample help? Or make things worse?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.