How to read stackoverflow archive post data to python? [duplicate]

Question

I downloaded the archival data of stackoverflow posts from https://archive.org/details/stackexchange.

However, after expanding the 7z file, I have a 103GB xml file of posts.

I tried to use python to load it but the server failed.

What can I do to load such a large file to python to analyze the data?

I tried the codes from following links: how to convert xml file to csv using python script

XML to CSV Python

But python stopped at the step of open the xml file.

Hiren Namera · Accepted Answer · 2023-12-06 06:00:15Z

0

Yes, it is challenging i also suffered the same issue,

Instead of loading files completely in the RAM. you can stream the file. For XML file streaming you can use python's ElementTree.

import xml.etree.ElementTree as ET

def process_element(elem):
    # your code goes here
    print(elem.tag, elem.attrib)

file_path = 'demo.xml'

# Create a parser
context = ET.iterparse(file_path, events=('start', 'end'))

# Iterate the file
for event, elem in context:
    if event == 'end':  
        process_element(elem)
        elem.clear()  # Free up memory

# Must not forget
context.root.clear()

answered Dec 6, 2023 at 6:00

Hiren Namera

4343 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

T. Xu Over a year ago

Thank you! I used fileinput library to work on it line by line in the end

Collectives™ on Stack Overflow

How to read stackoverflow archive post data to python? [duplicate]

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related