XML to postgres using python

Question

I have one XML file that has data from the American Registry of Internet numbers which I need to use to do a lookup on the CIDR blocks. The amount of data is around 60GB.

The structure is not so complex (could be mapped to perhaps 4 tables in a relational model)

<asn>
    <ref>https://whois.arin.net/rest/v1/asn/AS0</ref>
    <pocLinks/>
    <endAsNumber>0</endAsNumber>
    <handle>AS0</handle>
    <name>IANA-RSVD-0</name>
    <orgHandle>IANA</orgHandle>
    <comment>
        <line number="0">Reserved - May be used to identify non-routed networks</line>
    </comment>
    <registrationDate>2002-09-13T15:46:16-04:00</registrationDate>
    <startAsNumber>0</startAsNumber>
    <updateDate>2002-09-13T15:46:16-04:00</updateDate>
</asn>



<org>
    <ref>https://whois.arin.net/rest/v1/org/C07182099</ref>
    <city>BEVERLY HILLS</city>
    <iso3166-1>
        <code2>US</code2>
        <code3>USA</code3>
        <name>United States</name>
        <e164>1</e164>
    </iso3166-1>
    <handle>C07182099</handle>
    <customer>Y</customer>
    <name>NICHOLAS R. NIKOLOV</name>
    <pocLinks/>
    <parentOrgHandle>CC-3517</parentOrgHandle>
    <postalCode>90210</postalCode>
    <registrationDate>2018-10-26T03:59:46-04:00</registrationDate>
    <iso3166-2>CA</iso3166-2>
    <streetAddress>
        <line number="0">436 N BEDFORD DR</line>
    </streetAddress>
    <updateDate>2018-10-26T03:59:46.821-04:00</updateDate>
</org>



<net>
    <ref>https://whois.arin.net/rest/v1/net/NET-64-179-77-160-1</ref>
    <endAddress>64.179.77.167</endAddress>
    <handle>NET-64-179-77-160-1</handle>
    <name>CHOICE1SWIPP771609</name>
    <netBlocks>
        <netBlock>
            <cidrLenth>29</cidrLenth>
            <endAddress>064.179.077.167</endAddress>
            <type>S</type>
            <startAddress>064.179.077.160</startAddress>
        </netBlock>
    </netBlocks>
    <pocLinks/>
    <orgHandle>C00477198</orgHandle>
    <parentNetHandle>NET-64-179-0-0-1</parentNetHandle>
    <registrationDate>2003-02-24T12:20:58-05:00</registrationDate>
    <startAddress>64.179.77.160</startAddress>
    <updateDate>2003-02-24T12:20:58-05:00</updateDate>
    <version>4</version>
</net>  



<poc>
    <ref>https://whois.arin.net/rest/v1/poc/FLOYD11-ARIN</ref>
    <city>Alexandria</city>
    <companyName>Aireco8Alexandria</companyName>
    <iso3166-1>
        <code2>US</code2>
        <code3>USA</code3>
        <name>United States</name>
        <e164>1</e164>
    </iso3166-1>
    <handle>FLOYD11-ARIN</handle>
    <isRoleAccount>Y</isRoleAccount>
    <emails>
        <email>[email protected]</email>
    </emails>
    <lastName>Floyd</lastName>
    <phones>
        <phone>
            <number>
                <phoneNumber>+1-703-941-0660</phoneNumber>
                <phoneType>O</phoneType>
                <pocHandle>FLOYD11-ARIN</pocHandle>
            </number>
            <type>
                <description>Office</description>
                <code>O</code>
            </type>
        </phone>
    </phones>
    <postalCode>22312</postalCode>
    <registrationDate>2006-04-04T19:27:40-04:00</registrationDate>
    <iso3166-2>VA</iso3166-2>
    <streetAddress>
        <line number="0">5712-F Gen Washington Dr</line>
    </streetAddress>
    <updateDate>2006-04-04T19:27:40-04:00</updateDate>
</poc>

There are 4 different types of root elements, namely - POC , asn , org and net, and multiple data points for each type

The data needs to be pushed to 4 different tables in an RDS Postgres instance and here is how I intend to do it:

divide XML file based on the root elements
parse each type differently
load to df and do df.to_sql

Is there a faster way to achieve this?

4 different types of root elements ... then this markup is not XML. By definition, XML has one root among other rules. All compliant XML libraries by default should fail to parse such markup. — Parfait
– Parfait, Commented Mar 30, 2022 at 16:01
would segregation of the doc into 4 different docs and then parsing work? — Devarshi Goswami
– Devarshi Goswami, Commented Mar 30, 2022 at 16:28
Sure that will work but check your process in how you acquired or generated such markup. Avoid building XML with string concatenation. — Parfait
– Parfait, Commented Mar 30, 2022 at 21:42

eatmeimadanish · Accepted Answer · 2022-03-30 15:53:25Z

1

A potential solution.

Postgres has a jsonb store that handles unstructured data. You could use the xmltodict module via xmltodict.parse() and convert the XML to a python dict.

Than you can convert this to JSON and dump into the database.
You can query levels/depth in 2 ways:

SELECT data FROM table WHERE data->'asn'->>'name' = 'IANA-RSVD-0'
SELECT data FROM table WHERE data @> '{"asn":{"name":"IANA-RSVD-0"}}'

The latter query will take advantage of the JSONB_PATH_OPS index.

https://www.postgresql.org/docs/11/functions-json.html

answered Mar 30, 2022 at 15:53

eatmeimadanish

3,9231 gold badge16 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

XML to postgres using python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related