0

I have one XML file that has data from the American Registry of Internet numbers which I need to use to do a lookup on the CIDR blocks. The amount of data is around 60GB.

The structure is not so complex (could be mapped to perhaps 4 tables in a relational model)

<asn>
    <ref>https://whois.arin.net/rest/v1/asn/AS0</ref>
    <pocLinks/>
    <endAsNumber>0</endAsNumber>
    <handle>AS0</handle>
    <name>IANA-RSVD-0</name>
    <orgHandle>IANA</orgHandle>
    <comment>
        <line number="0">Reserved - May be used to identify non-routed networks</line>
    </comment>
    <registrationDate>2002-09-13T15:46:16-04:00</registrationDate>
    <startAsNumber>0</startAsNumber>
    <updateDate>2002-09-13T15:46:16-04:00</updateDate>
</asn>



<org>
    <ref>https://whois.arin.net/rest/v1/org/C07182099</ref>
    <city>BEVERLY HILLS</city>
    <iso3166-1>
        <code2>US</code2>
        <code3>USA</code3>
        <name>United States</name>
        <e164>1</e164>
    </iso3166-1>
    <handle>C07182099</handle>
    <customer>Y</customer>
    <name>NICHOLAS R. NIKOLOV</name>
    <pocLinks/>
    <parentOrgHandle>CC-3517</parentOrgHandle>
    <postalCode>90210</postalCode>
    <registrationDate>2018-10-26T03:59:46-04:00</registrationDate>
    <iso3166-2>CA</iso3166-2>
    <streetAddress>
        <line number="0">436 N BEDFORD DR</line>
    </streetAddress>
    <updateDate>2018-10-26T03:59:46.821-04:00</updateDate>
</org>



<net>
    <ref>https://whois.arin.net/rest/v1/net/NET-64-179-77-160-1</ref>
    <endAddress>64.179.77.167</endAddress>
    <handle>NET-64-179-77-160-1</handle>
    <name>CHOICE1SWIPP771609</name>
    <netBlocks>
        <netBlock>
            <cidrLenth>29</cidrLenth>
            <endAddress>064.179.077.167</endAddress>
            <type>S</type>
            <startAddress>064.179.077.160</startAddress>
        </netBlock>
    </netBlocks>
    <pocLinks/>
    <orgHandle>C00477198</orgHandle>
    <parentNetHandle>NET-64-179-0-0-1</parentNetHandle>
    <registrationDate>2003-02-24T12:20:58-05:00</registrationDate>
    <startAddress>64.179.77.160</startAddress>
    <updateDate>2003-02-24T12:20:58-05:00</updateDate>
    <version>4</version>
</net>  



<poc>
    <ref>https://whois.arin.net/rest/v1/poc/FLOYD11-ARIN</ref>
    <city>Alexandria</city>
    <companyName>Aireco8Alexandria</companyName>
    <iso3166-1>
        <code2>US</code2>
        <code3>USA</code3>
        <name>United States</name>
        <e164>1</e164>
    </iso3166-1>
    <handle>FLOYD11-ARIN</handle>
    <isRoleAccount>Y</isRoleAccount>
    <emails>
        <email>[email protected]</email>
    </emails>
    <lastName>Floyd</lastName>
    <phones>
        <phone>
            <number>
                <phoneNumber>+1-703-941-0660</phoneNumber>
                <phoneType>O</phoneType>
                <pocHandle>FLOYD11-ARIN</pocHandle>
            </number>
            <type>
                <description>Office</description>
                <code>O</code>
            </type>
        </phone>
    </phones>
    <postalCode>22312</postalCode>
    <registrationDate>2006-04-04T19:27:40-04:00</registrationDate>
    <iso3166-2>VA</iso3166-2>
    <streetAddress>
        <line number="0">5712-F Gen Washington Dr</line>
    </streetAddress>
    <updateDate>2006-04-04T19:27:40-04:00</updateDate>
</poc>

There are 4 different types of root elements, namely - POC , asn , org and net, and multiple data points for each type

The data needs to be pushed to 4 different tables in an RDS Postgres instance and here is how I intend to do it:

  1. divide XML file based on the root elements
  2. parse each type differently
  3. load to df and do df.to_sql

Is there a faster way to achieve this?

3
  • 4 different types of root elements ... then this markup is not XML. By definition, XML has one root among other rules. All compliant XML libraries by default should fail to parse such markup. Commented Mar 30, 2022 at 16:01
  • would segregation of the doc into 4 different docs and then parsing work? Commented Mar 30, 2022 at 16:28
  • Sure that will work but check your process in how you acquired or generated such markup. Avoid building XML with string concatenation. Commented Mar 30, 2022 at 21:42

1 Answer 1

1

A potential solution.

Postgres has a jsonb store that handles unstructured data. You could use the xmltodict module via xmltodict.parse() and convert the XML to a python dict.

Than you can convert this to JSON and dump into the database.
You can query levels/depth in 2 ways:

SELECT data FROM table WHERE data->'asn'->>'name' = 'IANA-RSVD-0'
SELECT data FROM table WHERE data @> '{"asn":{"name":"IANA-RSVD-0"}}'

The latter query will take advantage of the JSONB_PATH_OPS index.

https://www.postgresql.org/docs/11/functions-json.html

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.