0

I want to transform my XML file into a dataframe pandas I tried this code

import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("C:/Users/user/Desktop/essai/dataXml.xml", "r"),"xml")

d = {}
for tag in soup.RECORDING.find_all(recursive=False):
    
    d[tag.name] = tag.get_text(strip=True)
df = pd.DataFrame([d])
print(df)

and this is a portion of my XML data


<?xml version="1.0" encoding="utf-8"?>
<sentences>
    <sentence>
        <text>We went again and sat at the bar this time, I had 5 pints of guinness and not one buy-back, I ordered a basket of onion rings and there were about 5 in the basket, the rest was filled with crumbs, the chili was not even edible.</text>
        <aspectCategories>
            <aspectCategory category="place" polarity="neutral"/>
            <aspectCategory category="food" polarity="negative"/>
        </aspectCategories>
    </sentence>
</sentences>`

and I got this error

for tag in soup.RECORDING.find_all(recursive=False):
AttributeError: 'NoneType' object has no attribute 'find_all'

How can I fix it?

and thank you in advance

edit: replacing soup.RECORDING.find_all with soup.find_all fixed the error but still I don't get what I want

I want something like this enter image description here

7
  • 2
    Why did you do soup.RECORDING.find_all instead of just soup.find_all? Commented Nov 25, 2021 at 19:17
  • I'm just a beginner :( soup.find_all fixed the error but still I didn't get wat I wanted Commented Nov 25, 2021 at 19:22
  • 1
    Will you please add a sample dataframe containing your expected output to the question? I'll help you if so :) Commented Nov 25, 2021 at 19:27
  • actually, I don't know if a dataframe is the solution maybe I need to use a dict, what I need is to manage this data using python thank you in advance Commented Nov 25, 2021 at 19:36
  • 1
    Ok, it doesn't matter. Just show the output you expect Commented Nov 25, 2021 at 19:37

2 Answers 2

1

Try this code:

d = {
    'text': [],
    'aspect': [],
    'polarity': []
}

for sentence in soup.find_all('sentence'):
    text = sentence.find('text').text
    for ac in sentence.find_all('aspectCategory'):
        d['text'].append(text)
        d['aspect'].append(ac.get('category'))
        d['polarity'].append(ac.get('category'))
    
df = pd.DataFrame(d)

Output:

>>> df
                                                text aspect polarity
0  We went again and sat at the bar this time, I ...  place    place
1  We went again and sat at the bar this time, I ...   food     food
Sign up to request clarification or add additional context in comments.

Comments

1

Consider the new pandas 1.3.0 method, read_xml, but join two calls for the different level of nodes. Default parser is lxml but can use the built-in etree to avoid the third-party XML package.

import pandas as pd
import xml.etree.ElementTree as et

xml_file = "C:/Users/user/Desktop/essai/dataXml.xml"
doc = et.parse(xml_file)

df_list = [
    (pd.read_xml(xml_file, xpath=f".//sentence[{i}]", parser="etree")
       .join(pd.read_xml(
           xml_file,
           xpath=f".//sentence[{i}]/aspectCategories/*", 
           parser="etree"
       ))
    ) for i, s in enumerate(doc.iterfind(".//sentence"), start=1)
]

df = pd.concat(df_list)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.