0

I have thousands of XML files

that has many tags and many properties.

I want to put the data in SQL Server table.

The structure of the table is going to be something like this

DocumentPath varchar(1000)
Tag          varchar(1000)
ID           varchar(1000)
Attribute    varchar(1000)
Value        varchar(1000)

the XML file looks like this

<?xml version="1.0" encoding="UTF-8"?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:util="http:///org/" xmi:version="2.0">
    <cas:NULL xmi:id="0"/>
    <tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="5769" language="x-unspecified"/>
    <structured:DocumentID xmi:id="13" documentID="5"/>
    <structured:DocumentIdPrefix xmi:id="15" documentIdPrefix=""/>
    <structured:DocumentPath xmi:id="35" documentPath="J:\T9.xmi"/>
    <textspan:Segment xmi:id="37" sofa="1" begin="0" end="5769" id="SIMPLE_SEGMENT" preferredText="SIMPLE_SEGMENT"/>
    <textspan:Sentence xmi:id="44" sofa="1" begin="0" end="15" sentenceNumber="0"/>
    <textspan:Sentence xmi:id="50" sofa="1" begin="17" end="32" sentenceNumber="1"/>
    <textspan:Sentence xmi:id="56" sofa="1" begin="37" end="52" sentenceNumber="2"/>
    <syntax:SymbolToken xmi:id="2242" sofa="1" begin="18" end="19" tokenNumber="4"/>
    <syntax:SymbolToken xmi:id="2250" sofa="1" begin="19" end="20" tokenNumber="5"/>
    <syntax:SymbolToken xmi:id="2301" sofa="1" begin="29" end="30" tokenNumber="11"/>
    <syntax:SymbolToken xmi:id="2309" sofa="1" begin="30" end="31" tokenNumber="12"/>
    <syntax:NumToken xmi:id="2258" sofa="1" begin="20" end="24" tokenNumber="6" numType="1"/>
    <syntax:NumToken xmi:id="2275" sofa="1" begin="25" end="26" tokenNumber="8" numType="1"/>
    <syntax:NumToken xmi:id="2292" sofa="1" begin="27" end="29" tokenNumber="10" numType="1"/>
    <syntax:NumToken xmi:id="2381" sofa="1" begin="57" end="61" tokenNumber="20" numType="1"/>
    <textsem:MedicationMention xmi:id="18523" sofa="1" begin="534" end="540" id="0" ontologyConceptArr="18510" typeID="1" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:MedicationMention xmi:id="16831" sofa="1" begin="543" end="547" id="0" ontologyConceptArr="16818" typeID="1" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:MedicationMention xmi:id="16788" sofa="1" begin="558" end="569" id="0" ontologyConceptArr="16753 16773 16763" typeID="1" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:ProcedureMention xmi:id="22518" sofa="1" begin="328" end="343" id="0" ontologyConceptArr="22505" typeID="5" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:ProcedureMention xmi:id="22682" sofa="1" begin="478" end="490" id="0" ontologyConceptArr="22669" typeID="5" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:ProcedureMention xmi:id="22228" sofa="1" begin="495" end="510" id="0" ontologyConceptArr="22215" typeID="5" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:ProcedureMention xmi:id="21938" sofa="1" begin="794" end="809" id="0" ontologyConceptArr="21925" typeID="5" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:ProcedureMention xmi:id="22868" sofa="1" begin="1057" end="1072" id="0" ontologyConceptArr="22855" typeID="5" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:AnatomicalSiteMention xmi:id="24333" sofa="1" begin="786" end="793" id="0" ontologyConceptArr="24320" typeID="6" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:AnatomicalSiteMention xmi:id="25103" sofa="1" begin="842" end="857" id="0" ontologyConceptArr="25090" typeID="6" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:AnatomicalSiteMention xmi:id="23654" sofa="1" begin="842" end="850" id="0" ontologyConceptArr="23641" typeID="6" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:EventMention xmi:id="25137" sofa="1" begin="117" end="124" id="0" ontologyConceptArr="25124" typeID="1003" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:EventMention xmi:id="25201" sofa="1" begin="3448" end="3458" id="0" ontologyConceptArr="25188" typeID="1003" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <textsem:EventMention xmi:id="25169" sofa="1" begin="3796" end="3803" id="0" ontologyConceptArr="25156" typeID="1003" discoveryTechnique="1" confidence="0.0" polarity="0" uncertainty="0" conditional="false" generic="false" historyOf="0"/>
    <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text"/>
    <structured:SourceData xmi:id="23" noteTypeCode="ClinicalNote" sourceRevisionNbr="0" sourceRevisionDate="2020-03-02 10:48:30"/>
    <refsem:UmlsConcept xmi:id="18510" codingScheme="SNOMEDCT_US" code="64197008" score="0.0" disambiguated="false" cui="C0037366" tui="T131" preferredText="Smoke"/>
    <refsem:UmlsConcept xmi:id="16818" codingScheme="RXNORM" code="746839" score="0.0" disambiguated="false" cui="C1999262" tui="T122" preferredText="Pack"/>
    <refsem:UmlsConcept xmi:id="16753" codingScheme="SNOMEDCT_US" code="410942007" score="0.0" disambiguated="false" cui="C0013227" tui="T121" preferredText="Pharmaceutical Preparations"/>
    <cas:View sofa="1" members="8 13 15 17 35"/>
</xmi:XMI>

The value of structured:DocumentPath tag and documentPath attibute will be DocumentPath in the table

Tag is in each line in the XML tag e.g. textspan:Sentence

ID is the xmi:id value

Attribute is each Attribute name and Value is attributes value

Therefore this line

<textspan:Sentence xmi:id="44" sofa="1" begin="0" end="15" sentenceNumber="0"/>

will be in the table like this

DocumentPath  Tag                 ID    Attribute         Value        
J:\T9.xmi     textspan:Sentence   44    sofa              1
J:\T9.xmi     textspan:Sentence   44    begin             0
J:\T9.xmi     textspan:Sentence   44    end               15
J:\T9.xmi     textspan:Sentence   44    sentenceNumber    0

I found this website that shows how to read a simple XML file

enter image description here

but my case is not the same as number of attributes in each line is dynamic

4
  • What have you tried so far? Commented Mar 9, 2020 at 5:13
  • The approach is to transform the data to a list and then export it to Sql Server. You need to read every tag and every attribute (so even if it dynamic it will cache it). Some packages that can help here: lxml.etree - for read and extract data from the XML. pyodbc - to move the data to Sql Server. Commented Mar 9, 2020 at 5:36
  • I have done a similar case, I use one python package untangle to parse the XML to python object and then it will easy for you do manipulate the data, then use sqlalchemy to insert data to DB, you can definitely try untangle. FYI: pip install untangle Hope it helps! Commented Mar 9, 2020 at 6:05
  • @panda912 can you share the code? Commented Mar 9, 2020 at 6:07

1 Answer 1

1

Try pip intall untangle to install untangle. One example uses your input:

from untangle import parse
your_xml = """.. omit.."""
your_obj = parse(your_xml)
your_obj.xmi_XMI.textspan_Segment._attributes

{'xmi:id': '37',
 'sofa': '1',
 'begin': '0',
 'end': '5769',
 'id': 'SIMPLE_SEGMENT',
 'preferredText': 'SIMPLE_SEGMENT'}

As you can see all the data can be accessed like objects.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, but getting this error when I run your_obj=parse(your_xml) .. the error is "Traceback (most recent call last): File "<ipython-input-10-4bb1d5f72a9c>", line 1, in <module> your_obj = parse(your_xml) File "C:\Users\ASMGX\Anaconda3\lib\site-packages\untangle.py", line 176, in parse if is_string(filename) and (os.path.exists(filename) or is_url(filename)): File "C:\Users\ASMGX\Anaconda3\lib\genericpath.py", line 19, in exists os.stat(path) ValueError: stat: path too long for Windows"
Seem your path is too long? Try put it to another folder, you can just paste your XML like str wrapper with your_xml_str = """xml_content""", or you can also open it like file to read from.
Did not try that on windows, so you got your xml object now?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.