2

I know there are several answers to questions regarding xml parsing with Python 3, but I can't find the answer to two that I have. I am trying to parse and extract information from a BoardGameGeek xml file that looks like the following (it's too long for me to paste in here):

https://www.boardgamegeek.com/xmlapi/boardgame/10

1) I am having trouble extracting the primary game name from these two lines:

<name sortindex="1" primary="true">Elfenland</name>
<name sortindex="1">Elfenland (Волшебное Путешествие)</name>

2) I am also having trouble extracting lists of data, such as in this xml:

<poll title="User Suggested Number of Players" totalvotes="96"  name="suggested_numplayers">
    <results numplayers="1">
        <result numvotes="0" value="Best"/>
        <result numvotes="0" value="Recommended"/>
        <result numvotes="58" value="Not Recommended"/>
    </results>
    <results numplayers="2">
        <result numvotes="2" value="Best"/>
        <result numvotes="21" value="Recommended"/>
        <result numvotes="53" value="Not Recommended"/>
    </results>
    <results numplayers="3">
        <result numvotes="10" value="Best"/>
        <result numvotes="46" value="Recommended"/>
        <result numvotes="17" value="Not Recommended"/>
    </results>
        <results numplayers="4">
        <result numvotes="47" value="Best"/>
        <result numvotes="36" value="Recommended"/>
        <result numvotes="1" value="Not Recommended"/>
    </results>
    <results numplayers="5">
        <result numvotes="35" value="Best"/>
        <result numvotes="44" value="Recommended"/>
        <result numvotes="2" value="Not Recommended"/>
    </results>
    <results numplayers="6">
        <result numvotes="23" value="Best"/>
        <result numvotes="48" value="Recommended"/>
        <result numvotes="11" value="Not Recommended"/>
    </results>
    <results numplayers="6+">
        <result numvotes="0" value="Best"/>
        <result numvotes="1" value="Recommended"/>
        <result numvotes="46" value="Not Recommended"/>
    </results>
</poll>

Currently, my code is very simple, and looks like this. It only extracts simple one value xml lines. Any help on how to extract the more complex information would be great. Thank you.

url = 'https://www.boardgamegeek.com/xmlapi/boardgame/10'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') # a `str`; 
soup = BeautifulSoup(text,'xml')
yearpublished = soup.find_all('yearpublished')

1 Answer 1

6

For the first part try searching for the element "name" where the attribute "primary" is present like this:

from bs4 import BeautifulSoup
import urllib

url = 'https://www.boardgamegeek.com/xmlapi/boardgame/10'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') # a `str`;
soup = BeautifulSoup(text,'xml')
name = soup.find('name', primary = True)

print (name.get_text())

Outputs:

Elfenland

For the second loop over the "results" elements and extract the data you want:

text = """
<poll title="User Suggested Number of Players" totalvotes="96"  name="suggested_numplayers">
    <results numplayers="1">
        <result numvotes="0" value="Best"/>
...
        <result numvotes="46" value="Not Recommended"/>
    </results>
</poll>
"""
soup = BeautifulSoup(text,'xml')

for result in soup.find_all('results'):
    numplayers = result['numplayers']
    best = result.find('result', {'value': 'Best'})['numvotes']
    recommended = result.find('result', {'value': 'Recommended'})['numvotes']
    not_recommended = result.find('result', {'value': 'Not Recommended'})['numvotes']
    print (numplayers, best, recommended, not_recommended)

Outputs:

1 0 0 58
2 2 21 53
3 10 46 17
4 47 36 1
5 35 44 2
6 23 48 11
6+ 0 1 46

Or if you want to do it more elegantly find all of each attribute and zip them:

soup = BeautifulSoup(text,'xml')
numplayers = [tag['numplayers'] for tag in soup.find_all('results')]
best = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Best'})]
recommended = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Recommended'})]
not_recommended = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Not Recommended'})]
print(list(zip(numplayers, best, recommended, not_recommended)))

Outputs:

[('1', '0', '0', '58'), ('2', '2', '21', '53'), ('3', '10', '46', '17'), ('4', '47', '36', '1'), ('5', '35', '44', '2'), ('6', '23', '48', '11'), ('6+', '0', '1', '46')]

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.