AttributeError: 'list' object has no attribute 'splitlines' when converting code to Pandas Dataframe using function splitlines()

Question

I was given help with the splitlines() function which worked perfect on string output which wasn't seperated by page numbers, see How to Create Spark or Pandas Dataframe from str output in Apache Spark on Databricks

I am now using str_output = result.pages as opposed to str_output = result.content

Now, when I execute

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})
df_data

I get the following error:

AttributeError: 'list' object has no attribute 'splitlines'

I think its because of the way that I'm using the splitlines function, but I'm not sure.

Any help appreciated

I should show the full code, see below:

import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient

# field_list = ["result.content"]

document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = document_analysis_client.begin_analyze_document_from_url(
            "prebuilt-read", blob_url)
  result = poller.result()
  print("Scanning " + blob.name + "...")
  print ("document contains", result.content)

myoutput = result.pages

df_data = pd.DataFrame({'RAWTEXT':myoutput.splitlines()})
df_data

As resuesting, a sample of the data is as follows:

Scanning 05Jul11 Raet Prelim.pdf... document contains PRELIMINARY REPORT RAET HOLDING B.V. 5 JULY 2011 1 RæT CONTENTS 1 INVESTMENT PROPOSAL ............................................................................................................ 5 1.1 Background to business................................................................................................................ 5 1.2 Process ........................................................................................................................................ 6 1.2.1 Overview .............................................................................................................................. 6 1.2.2 Due Diligence ....................................................................................................................... 7 1.2.3 Banking / Financing .............................................................................................................. 8 1.2.4 Proposed Tactics / Recommendation .................................................................................... 8 1.3 Investment Overview .................................................................................................................... 9 1.3.1 Investment thesis .................................................................................................................. 9 1.3.2 Business Strengths ............................................................................................................... 9 1.3.3 Investment Case Returns .....................................................................................................11 1.4 Key judgment calls ......................................................................................................................12 1.5 Recommendation ........................................................................................................................18 2 MARKET AND BUSINESS

The issue is that you're expecting str_output to be a string, but it's actually a list. You probably want a for loop like for page in result.pages: and to use page.splitlines() rather than str_output.splitlines(). Inserting a print(type(str_output)) might also clarify things. — Sarah Messer
– Sarah Messer, Commented Jun 8, 2022 at 14:09
Hi Sarah, thanks so much for reaching out. I should point out that my coding skills aren't as advanced as your skills. I have updated the question with the fulll code. If you could show me where I ought to make the amendments that would be most helpful. Sorry for being lazy, but I need to produce some results quickly for my manager — Patterson
– Patterson, Commented Jun 8, 2022 at 14:16

Zero · Accepted Answer · 2022-06-08 14:17:52Z

0

Here str_output is a list while splitlines() is a function for string objects. If you just pass str_output as a value in the dictionary you shouldn't face this error.

df_data = pd.DataFrame({'ColumnA':str_output.splitlines()})

If this doesn't help then please put a sample of the data in str_output in the question.

answered Jun 8, 2022 at 14:17

Zero

1,9091 gold badge10 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Patterson Over a year ago

Zero, I did the following df_data = pd.DataFrame({'RAWTEXT':myoutput}) and I got the following output 0 DocumentPage(kind=document, page_number=1, ang... 1 DocumentPage(kind=document, page_number=2, ang...

Patterson Over a year ago

Any further help much appreciated.

Zero Over a year ago

@Patterson Please add this to the question for more clarity.

Patterson Over a year ago

Zero, I have added some sample output. Just so you, there are 45 pages in total for this particular document

Patterson Over a year ago

Hi, did my sample help? Or make things worse?

Collectives™ on Stack Overflow

AttributeError: 'list' object has no attribute 'splitlines' when converting code to Pandas Dataframe using function splitlines()

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related