1

I am currently using this code bellow to count the amount of text elements there are in the xml file.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('wiki.xml'), 'lxml')

count = 0

for text in soup.find_all('text', recursive=False):
    count += 1

print(count)

I am unable to display the full xml file because of its size but here is a quick snippet of it...

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>simplewiki</dbname>
    <base>https://simple.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.30.0-wmf.14</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
      <namespace key="2300" case="first-letter">Gadget</namespace>
      <namespace key="2301" case="first-letter">Gadget talk</namespace>
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
      <namespace key="2600" case="first-letter">Topic</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>April</title>
    <ns>0</ns>
    <id>1</id>
    <revision>
      <id>5753795</id>
      <parentid>5732421</parentid>
      <timestamp>2017-08-11T21:06:32Z</timestamp>
      <contributor>
        <ip>2602:306:3433:C7F0:188F:FDE3:9FBE:D0B0</ip>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">{{monththisyear|4}}
'''April''' is the fourth [[month]] of the [[year]], and comes between     [[March]] and [[May]]. It is one of four months to have 30 [[day]]s.

April always begins on the same day of week as [[July]], and additionally, [[January]] in leap years. April always ends on the same day of the week as [[December]].

April's [[flower]]s are the [[Sweet Pea]] and [[Asteraceae|Daisy]]. Its [[birthstone]] is the [[diamond]]. The meaning of the diamond is innocence.

In short for the final product I would like it to be able to search through the page elements to find the titles in which it will search for a specific phrase I have entered and then return the text element inside of that page, as well as if it can't find a result then it returns the top three most similar. Is this possible and can anyone help with it? I am flexible with the library used, meaning it doesn't have to be bs4. Thank you.

EDIT:

I've just found out that if I remove recursive=False from the above code it returns 1 rather than 0. No idea why?

EDIT:

I have also tried the bellow code but it too returns 0. Bellow is also the example of what I would like for the final product, all in a dictionary.

import xml.etree.ElementTree as ET

def get_data():
    tree = ET.parse(open("wiki.xml"))
    root = tree.getroot()
    results = {}
    for title in root.findall('./page/title') and text in root.findall('./page/revision/text'):
        results[title] = text
    return results

r = get_data()
print(len(r))

EDIT:

I have just tried some code on the xml file bellow...

<vehicles>
  <car name="BMW">
    <model>850 CSI</model>
    <speed>1000</speed>
  </car>
  <car name="Mercedes">
    <model>SL65</model>
    <speed>900</speed>
  </car>
  <car name="Jaguar">
    <model>EV400</model>
    <speed>850</speed>
  </car>
  <car name="Ferrari">
    <model>Enzo</model>
    <speed>2</speed>
  </car>
</vehicles>

This is the code I used...

from bs4 import BeautifulSoup

def get_data():
    soup = BeautifulSoup(open('test.xml'), 'lxml')
    count = 0
    for text in soup.select("vehicles car model"):
        count += 1
    return count

r = get_data()
print(r)

This script returned 4 which is the correct number. However when I change vehicles car model to page revision text and try it on the wiki.xml file it does not work and still returns 1. Note: In the wiki file there are more text elements then I have the time to count myself so 1 is defiantly incorrect.

EDIT:

This is the code I have been trying to use for parsing the file...

def parser(file_name="wiki.xml",save_to="weboffline.csv",url='http://www.mediawiki.org/xml/export-0.10/'):
    doc = tree.parse(file_name)
    titles = []
    texts = []
    for title in doc.findall('.//mediawiki{'+url+'}//page//title'):
        titles.append(title)
    for text in doc.findall('.//mediawiki{'+url+'}//page//revision//text'):
        texts.append(text)
    with open(save_to, mode='w') as file:
        writer = csv.writer(file)
        writer.writerow(['TITLES', 'TEXT'])
        for items in zip(titles,texts):
            writer.writerow(items)

However the CSV file this returns in just TITLES,TEXT. Does anyone have a solution?

1
  • In the code snippet where you use ElementTree, you don't take XML namespaces into account. See stackoverflow.com/q/20435500/407651 Commented Aug 1, 2019 at 10:14

2 Answers 2

1
import xml.etree.ElementTree as etree
import codecs
import csv
import time
import os

def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

def strip_tag_name(t,elem):
    t = elem.tag
    idx = k = t.rfind("}")
    if idx != -1:
        t = t[idx + 1:]
    return t

def parseWikiFile():
    PATH_WIKI_XML = ''
    FILENAME_WIKI = 'resources/wiki.xml'
    FILENAME_PAGES = 'resources/weboffline.csv'
    ENCODING = "utf-8"

    pathWikiXML = os.path.join(PATH_WIKI_XML, FILENAME_WIKI)
    pathPages = os.path.join(PATH_WIKI_XML, FILENAME_PAGES)

    pageCount = 0
    totalCount = 0
    title = None
    start_time = time.time()

    with codecs.open(pathPages, "w", ENCODING) as pagesFH:
        pageWriter = csv.writer(pagesFH, quoting=csv.QUOTE_MINIMAL)
        pageWriter.writerow(['title', 'text'])

        for event, elem in etree.iterparse(pathWikiXML, events=('start', 'end')):
            tname = strip_tag_name(elem.tag,elem)

            if event == 'start':
                if tname == 'page':
                    title = ''
                    text = ''
                else:
                    continue
            else:
                if tname == 'title':
                    title = elem.text
                elif tname == 'text':
                    text = elem.text
                elif tname == 'page':
                    totalCount += 1
                    pageCount += 1
                    pageWriter.writerow([title, text])

                    if totalCount > 1 and (totalCount % 100000) == 0:
                        print("{:,}".format(totalCount))

                elem.clear()

    elapsed_time = time.time() - start_time

    print("Total pages: {:,}".format(totalCount))
    print("Text pages: {:,}".format(pageCount))
    print("Elapsed time: {}".format(hms_string(elapsed_time)))

This is code I modified from this website which works really well. I think what I was originally trying to do was just look for the elements as a location rather than a tag name. Anyway this works.

Sign up to request clarification or add additional context in comments.

Comments

0

recursive=False will only find direct children of the top level element. In the example you show, the only children of <mediawiki> are <siteinfo> and <page>, no <text>, so 0 is correct. By recursing into the structure we find a single <text> element as a child while recursing into <page> then <revision>. So 1 is correct!

If you want to find the children-of-children (etc) like this, you must use recursive=True (which is implied by omitting the recursive option).

3 Comments

Thank you for the response, however in the full document there is much more <text> elements than just 1 and it still returns that, the other thing is that because <page> is a child of <mediawiki> could I get the title and text element from all page sections and add them to separate lists?
Can you place wiki.xml online somewhere so we can test it please? Or tell us how we can get it in this format?
Download the torrent called simplewiki-20170820-pages-meta-current.xml.bz2 from meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia and then get the zip using a program like utorrent.com. The reason I didn't post it is because the xml file is over 1GB.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.