0

I downloaded the archival data of stackoverflow posts from https://archive.org/details/stackexchange.

However, after expanding the 7z file, I have a 103GB xml file of posts.

I tried to use python to load it but the server failed.

What can I do to load such a large file to python to analyze the data?

I tried the codes from following links: how to convert xml file to csv using python script

XML to CSV Python

But python stopped at the step of open the xml file.

0

1 Answer 1

0

Yes, it is challenging i also suffered the same issue,

Instead of loading files completely in the RAM. you can stream the file. For XML file streaming you can use python's ElementTree.

import xml.etree.ElementTree as ET

def process_element(elem):
    # your code goes here
    print(elem.tag, elem.attrib)

file_path = 'demo.xml'

# Create a parser
context = ET.iterparse(file_path, events=('start', 'end'))

# Iterate the file
for event, elem in context:
    if event == 'end':  
        process_element(elem)
        elem.clear()  # Free up memory

# Must not forget
context.root.clear()
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! I used fileinput library to work on it line by line in the end

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.