diff --git a/README.md b/README.md index 36e9b2b..f9026c4 100644 --- a/README.md +++ b/README.md @@ -1,58 +1,74 @@ # StackOverflow data to postgres -This is a quick script to move the Stackoverflow data from the [StackExchange data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres SQL database. - -Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) and from [StackExchange Data Explorer](http://data.stackexchange.com). - -## Dependencies - - - [`lxml`](http://lxml.de/installation.html) - - [`psycopg2`](http://initd.org/psycopg/docs/install.html) - - [`libarchive-c`](https://pypi.org/project/libarchive-c/) - -## Usage - - - Create the database `stackoverflow` in your database: `CREATE DATABASE stackoverflow;` - - You can use a custom database name as well. Make sure to explicitly give - it while executing the script later. - - Move the following files to the folder from where the program is executed: - `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. - - In some old dumps, the cases in the filenames are different. - - Execute in the current folder (in parallel, if desired): - - `python load_into_pg.py -t Badges` - - `python load_into_pg.py -t Posts` - - `python load_into_pg.py -t Tags` (not present in earliest dumps) - - `python load_into_pg.py -t Users` - - `python load_into_pg.py -t Votes` - - `python load_into_pg.py -t PostLinks` - - `python load_into_pg.py -t PostHistory` - - `python load_into_pg.py -t Comments` - - Finally, after all the initial tables have been created: - - `psql stackoverflow < ./sql/final_post.sql` - - If you used a different database name, make sure to use that instead of - `stackoverflow` while executing this step. - - For some additional indexes and tables, you can also execute the the following; - - `psql stackoverflow < ./sql/optional_post.sql` - - Again, remember to user the correct database name here, if not `stackoverflow`. - -## Loading a complete stackexchange project - -You can use the script to download a given stackexchange compressed file from +This is a quick script to move the Stackoverflow data from the [StackExchange +data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres +SQL database. + +Schema hints are taken from [a post on +Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) +and from [StackExchange Data Explorer](http://data.stackexchange.com). + +## Quickstart + +Install requirements, create a `stackoverflow` database, and use +`load_into_pg.py` script: + +``` console +$ pip install -r requirements.txt +... +Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0 +$ createdb beerSO +$ python load_into_pg.py -s beer -d beerSO +``` + +This will download compressed files from [archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load -all the tables at once, using the `-s` switch. +all the tables at once. + + +## Advanced Usage + +You can use a custom database name as well. Make sure to explicitly give it +while executing the script later. + +Each table data is archived in an XML file. Available tables varies accross +history. `load_into_pg.py` knows how to handle the following tables: -You will need the `urllib` and `libarchive` modules. +- `Badges`. +- `Posts`. +- `Tags` (not present in earliest dumps). +- `Users`. +- `Votes`. +- `PostLinks`. +- `PostHistory`. +- `Comments`. + +You can download manually the files to the folder from where the program is +executed: `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. In +some old dumps, the cases in the filenames are different. + +Then load each file with e.g. `python load_into_pg.py -t Badges`. + +After all the initial tables have been created: + +``` console +$ psql beerSO < ./sql/final_post.sql +``` + +For some additional indexes and tables, you can also execute the the following; + +``` console +$ psql beerSO < ./sql/optional_post.sql +``` If you give a schema name using the `-n` switch, all the tables will be moved to the given schema. This schema will be created in the script. -To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute: -`./load_into_pg.py -s dba -n dba` - The paths are not changed in the final scripts `sql/final_post.sql` and `sql/optional_post.sql`. To run them, first set the _search_path_ to your schema name: `SET search_path TO ;` + ## Caveats and TODOs - It prepares some indexes and views which may not be necessary for your analysis. @@ -68,3 +84,4 @@ schema name: `SET search_path TO ;` ## Acknowledgement [@madtibo](https://github.com/madtibo) made significant contributions by adding `jsonb` and Foreign Key support. +[@bersace](https://github.com/bersace) brought the dependencies and the `README.md` instructions into 2020. diff --git a/requirements.txt b/requirements.txt index e1c997c..10665d2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,5 @@ argparse==1.2.1 -distribute==0.6.24 -lxml==3.4.1 -psycopg2==2.5.4 -wsgiref==0.1.2 +libarchive-c==2.9 +lxml==4.5.2 +psycopg2-binary==2.8.4 six==1.10.0