0

I am parsing large projects with many thousand XML files for specific Elements and Attributes. I have managed to print all the Elements and Attributes I want but I cannot write them into a CSV Table. It would be great if I could get every occurrence of every Element/Attribute under the respective headers. The Problem is that I get "NameError: name 'X' is not defined", I do not know how to restructure, everything seemed to be working fine with my variables until I moved them to a CSV.

from logging import root
import xml.etree.ElementTree as ET
import csv
import os
path = r'C:\Users\briefe\V'

f = open('jp-elements.csv', 'w', encoding="utf-8")
writer = csv.writer(f)
writer.writerow(["Note", "Supplied", "@Certainty", "@Source"])


    #opening files in folder for project
for filename in os.listdir(path):
        if filename.endswith(".xml"):
            fullpath = os.path.join(path, filename)
        #getting the root of each file as my starting point
        for file in fullpath:
            tree = ET.parse(fullpath)
            root = tree.getroot()
            try:
                for note in root.findall('.//note'):
                    notes = note.attrib, note.text
                for supplied in root.findall(".//supplied"):
                    print(supplied.attrib)
                    for suppliedChild in supplied.findall(".//*"):
                        supplies = suppliedChild.tag, suppliedChild.attrib
                #attribute search
                for responsibility in root.findall(".//*[@resp]"):
                    responsibilities = responsibility.tag, responsibility.attrib, responsibility.text
                for certainty in root.findall(".//*[@cert]"):
                    certainties = certainty.tag, certainty.attrib, certainty.text
                writer.writerow([notes, supplies, responsibilities, certainties])
            finally:
                f.close()

As was kindly advised I am trying to save results that looked like:

{http://www.tei-c.org/ns/1.0}add {'resp': '#MB', 'status': 'unremarkable'} Nach H gedruckt IV. Abt., V, Anhang Nr.
                     10.
{http://www.tei-c.org/ns/1.0}date {'cert': 'medium', 'when': '1805-04-09'} 9. April 1805

I am trying to save these mixtures of tuples and dictionary items as strings into csv fields. But I get "NameError: name 'notes' is not defined" for example.

XML code example:

<?xml version="1.0" encoding="UTF-8"?><TEI xmlns="http://www.tei-c.org/ns/1.0" type="letter" xml:id="V_100">
   <teiHeader>
</teiHeader>
   <text>
      <body>
         <div type="letter">
            <note type="ig">Kopie</note>
            <p>Erlauben Sie mir, in Ihre Ehrenpforte noch einige Zwick<lb xml:id="V_39-7" rendition="#hyphen"/>steinchen einzuschieben. Philemon und Baucis müssen —
                  wenn<note corresp="#V_39-8">
                  <listPerson type="lineReference">
                     <person corresp="#JP-000228">
                        <persName>
                           <name cert="high" type="reg">Baucis</name>
                        </persName>
                     </person>
                     <person corresp="#JP-003214" ana="†">
                        <persName>
                           <name cert="low" type="reg">Philemon</name>
                        </persName>
                     </person>
                  </listPerson>
            <p>
               <hi rendition="#aq">Der Brief ist vielleicht nicht an den Minister Hardenberg
                  gerichtet,<lb/>
            </p>
            <lb/>
         </div>
      </body>
   </text>
</TEI>
8
  • 1
    Please post sample XML for us to help. Remember XML is a different data format than the two dimensions of CSV. You appear to be saving tuples of strings and dictionaries (result of attrib) to every row. Ideally, scalar strings/numbers should be saved to every CSV row. Commented Mar 19, 2022 at 23:53
  • @Parfait I added the results I got from print(). The actual XML might look like this: <choice> <sic>cheesemakers</sic> <corr resp="#editor" cert="high">peacemakers</corr> </choice>: for they shall be called the children of God. It is all TEI conform but I am looking for many different elements and attributes - I just want the key info - tag, attributes and text all to be added as a string to a csv field Commented Mar 20, 2022 at 9:58
  • Why do you want to save nested data within cells of CSV as tuples/dicts will cause? There are ways to parse all individual items to separate cells. Commented Mar 20, 2022 at 13:11
  • But still cannot fully help without sample XML required of a minimal reproducible example. Possibly, notes is never assigned in one iteration since its for loop retrieves nothing. Results indicate XML may have namespaces which can vary by elements. Because of namespaces, always post at least root of XML. Your snippet can be anywhere in document. Commented Mar 20, 2022 at 13:13
  • The most important thing is that I want to see if elements and attributes occur in the parsed XML files or not - that is why I am not too fussed about separating the data I get yet because I only care there is an element or not and how many there are. I want to use the same code on different projects that will have slightly different encoding guidelines. That is why I didn't add a XML example at first but now I did - all of the XML files will have the same TEI namespace because they all have the same root. Commented Mar 20, 2022 at 13:37

1 Answer 1

1

As posted, the XML has a default namespace at root which must be accounted for every named reference of an element such as <note>. Therefore, consider this adjustment where notes will be properly assigned.

nsmp = "http://www.tei-c.org/ns/1.0"

for note in root.findall(f'.//{{{nsmp}}}note'):
    notes = note.attrib, note.text

The triple curly brace is to ensure the interpolated string value is enclosed in curly braces which is also the symbol used in F strings. Do note, your code will then also err for supplies not being found.


However, given your comments, consider a dynamic solution which does not hard code any element name but parses all elements and attributes and flattens the output to CSV format. Below uses nested list/dict comprehensions to parse XML data and migrates to CSV using csv.DictWriter which maps dictionaries to field names of a CSV. Also, below uses context manager, with(), to write to text and requires no close() command.

with open('Output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(
        f, fieldnames=['element_or_attribute', 'text_or_value']
    )
  
    # MERGES DICTIONARIES OF ELEMENTS AND ATTRIBUTES
    # DICT KEYS REMOVE NAMESPACES AND CHECKS FOR NoneTypes
    # ATTRIBUTES ARE PREFIXED WITH PARENT ELEMENT NAME
    xml_dicts = [{
        **{el.tag.split('}')[1]:(
            el.text.strip() if el.text is not None else el.text
          )}, 
        **{(
            el.tag.split('}')[1]+'_'+k.split('}')[1] 
            if '}' in k 
            else el.tag.split('}')[1]+'_'+k):v 
           for k,v in el.attrib.items()}
    } for i, el in enumerate(root.findall(f'.//*'), start=1)]
    
    # COMBINES ABOVE DICTS INTO FLATTER FORMAT
    csv_dicts = [
        {'element_or_attribute': k, 'text_or_value':v} 
        for d in xml_dicts  
        for k, v in d.items()
    ]
    
    writer.writeheader()
    writer.writerows(csv_dicts)

Above should be integrated into your loop of files where this processes one XML file to one CSV.

CSV Output

element_or_attribute text_or_value
teiHeader
text
body
div
div_type letter
note Kopie
note_type ig
p "Erlauben Sie mir, in Ihre Ehrenpforte noch einige Zwick"
lb
lb_id V_39-7
lb_rendition #hyphen
note
note_corresp #V_39-8
listPerson
listPerson_type lineReference
person
person_corresp #JP-000228
persName
name Baucis
name_cert high
name_type reg
person
person_corresp #JP-003214
person_ana
persName
name Philemon
name_cert low
name_type reg
p
hi "Der Brief ist vielleicht nicht an den Minister Hardenberg\n gerichtet,"
hi_rendition #aq
lb
lb

XML Input (corrected for reproducibility)

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" type="letter" xml:id="V_100">
   <teiHeader></teiHeader>
   <text>
      <body>
         <div type="letter">
            <note type="ig">Kopie</note>
            <p>Erlauben Sie mir, in Ihre Ehrenpforte noch einige Zwick<lb xml:id="V_39-7" rendition="#hyphen"/>steinchen einzuschieben. Philemon und Baucis müssen —
                  wenn<note corresp="#V_39-8"/>
                  <listPerson type="lineReference">
                     <person corresp="#JP-000228">
                        <persName>
                           <name cert="high" type="reg">Baucis</name>
                        </persName>
                     </person>
                     <person corresp="#JP-003214" ana="†">
                        <persName>
                           <name cert="low" type="reg">Philemon</name>
                        </persName>
                     </person>
                  </listPerson>
            </p>
            <p>
               <hi rendition="#aq">Der Brief ist vielleicht nicht an den Minister Hardenberg
                  gerichtet,<lb/></hi>
            </p>
            <lb/>
         </div>
      </body>
   </text>
</TEI>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.