This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.
Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.
- Create the database
stackoverflowin your database:CREATE DATABASE stackoverflow;- You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
- Move the following files to the folder from where the program is executed:
Badges.xml,Votes.xml,Posts.xml,Users.xml,Tags.xml.- In some old dumps, the cases in the filenames are different.
- Execute in the current folder (in parallel, if desired):
python load_into_pg.py -t Badgespython load_into_pg.py -t Postspython load_into_pg.py -t Tags(not present in earliest dumps)python load_into_pg.py -t Userspython load_into_pg.py -t Votespython load_into_pg.py -t PostLinkspython load_into_pg.py -t PostHistorypython load_into_pg.py -t Comments
- Finally, after all the initial tables have been created:
psql stackoverflow < ./sql/final_post.sql- If you used a different database name, make sure to use that instead of
stackoverflowwhile executing this step.
- For some additional indexes and tables, you can also execute the the following;
psql stackoverflow < ./sql/optional_post.sql- Again, remember to user the correct database name here, if not
stackoverflow.
You can use the script to download a given stackexchange compressed file from
archive.org and load
all the tables at once, using the -s switch.
You will need the urllib and libarchive modules.
If you give a schema name using the -n switch, all the tables will be moved
to the given schema. This schema will be created in the script.
To load the dba.stackexchange.com project in the dba schema, you would execute:
./load_into_pg.py -s dba -n dba
The paths are not changed in the final scripts sql/final_post.sql and
sql/optional_post.sql. To run them, first set the search_path to your
schema name: SET search_path TO <myschema>;
- It prepares some indexes and views which may not be necessary for your analysis.
- The
Bodyfield inPoststable is NOT populated by default. You have to use--with-post-bodyargument to include it. - The
EmailHashfield inUserstable is NOT populated.
- The
tags.xmlis missing from the data dump. Hence, thePostTagandUserTagQAtables will be empty afterfinal_post.sql. - The
ViewCountinPostsis sometimes equal to anemptyvalue. It is replaced byNULLin those cases.
@madtibo made significant contributions by adding jsonb and Foreign Key support.