Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 60 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,74 @@
# StackOverflow data to postgres

This is a quick script to move the Stackoverflow data from the [StackExchange data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres SQL database.

Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) and from [StackExchange Data Explorer](http://data.stackexchange.com).

## Dependencies

- [`lxml`](http://lxml.de/installation.html)
- [`psycopg2`](http://initd.org/psycopg/docs/install.html)
- [`libarchive-c`](https://pypi.org/project/libarchive-c/)

## Usage

- Create the database `stackoverflow` in your database: `CREATE DATABASE stackoverflow;`
- You can use a custom database name as well. Make sure to explicitly give
it while executing the script later.
- Move the following files to the folder from where the program is executed:
`Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`.
- In some old dumps, the cases in the filenames are different.
- Execute in the current folder (in parallel, if desired):
- `python load_into_pg.py -t Badges`
- `python load_into_pg.py -t Posts`
- `python load_into_pg.py -t Tags` (not present in earliest dumps)
- `python load_into_pg.py -t Users`
- `python load_into_pg.py -t Votes`
- `python load_into_pg.py -t PostLinks`
- `python load_into_pg.py -t PostHistory`
- `python load_into_pg.py -t Comments`
- Finally, after all the initial tables have been created:
- `psql stackoverflow < ./sql/final_post.sql`
- If you used a different database name, make sure to use that instead of
`stackoverflow` while executing this step.
- For some additional indexes and tables, you can also execute the the following;
- `psql stackoverflow < ./sql/optional_post.sql`
- Again, remember to user the correct database name here, if not `stackoverflow`.

## Loading a complete stackexchange project

You can use the script to download a given stackexchange compressed file from
This is a quick script to move the Stackoverflow data from the [StackExchange
data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres
SQL database.

Schema hints are taken from [a post on
Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)
and from [StackExchange Data Explorer](http://data.stackexchange.com).

## Quickstart

Install requirements, create a `stackoverflow` database, and use
`load_into_pg.py` script:

``` console
$ pip install -r requirements.txt
...
Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0
$ createdb beerSO
$ python load_into_pg.py -s beer -d beerSO
```

This will download compressed files from
[archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load
all the tables at once, using the `-s` switch.
all the tables at once.


## Advanced Usage

You can use a custom database name as well. Make sure to explicitly give it
while executing the script later.

Each table data is archived in an XML file. Available tables varies accross
history. `load_into_pg.py` knows how to handle the following tables:

You will need the `urllib` and `libarchive` modules.
- `Badges`.
- `Posts`.
- `Tags` (not present in earliest dumps).
- `Users`.
- `Votes`.
- `PostLinks`.
- `PostHistory`.
- `Comments`.

You can download manually the files to the folder from where the program is
executed: `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. In
some old dumps, the cases in the filenames are different.

Then load each file with e.g. `python load_into_pg.py -t Badges`.

After all the initial tables have been created:

``` console
$ psql beerSO < ./sql/final_post.sql
```

For some additional indexes and tables, you can also execute the the following;

``` console
$ psql beerSO < ./sql/optional_post.sql
```

If you give a schema name using the `-n` switch, all the tables will be moved
to the given schema. This schema will be created in the script.

To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute:
`./load_into_pg.py -s dba -n dba`

The paths are not changed in the final scripts `sql/final_post.sql` and
`sql/optional_post.sql`. To run them, first set the _search_path_ to your
schema name: `SET search_path TO <myschema>;`


## Caveats and TODOs

- It prepares some indexes and views which may not be necessary for your analysis.
Expand All @@ -68,3 +84,4 @@ schema name: `SET search_path TO <myschema>;`
## Acknowledgement

[@madtibo](https://github.com/madtibo) made significant contributions by adding `jsonb` and Foreign Key support.
[@bersace](https://github.com/bersace) brought the dependencies and the `README.md` instructions into 2020.
7 changes: 3 additions & 4 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
argparse==1.2.1
distribute==0.6.24
lxml==3.4.1
psycopg2==2.5.4
wsgiref==0.1.2
libarchive-c==2.9
lxml==4.5.2
psycopg2-binary==2.8.4
six==1.10.0