From 43318a61798c66cc6df7655b59b82e3eba559e7c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20BERSAC?= Date: Thu, 13 Aug 2020 08:06:45 +0200 Subject: [PATCH 1/8] Use psycopg2-binary --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index e1c997c..762acc6 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,6 @@ argparse==1.2.1 distribute==0.6.24 lxml==3.4.1 -psycopg2==2.5.4 +psycopg2-binary==2.8.4 wsgiref==0.1.2 six==1.10.0 From 082b0fc8d62c403dbdfe31e1467ce0ac99d93ba3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20BERSAC?= Date: Thu, 13 Aug 2020 08:08:33 +0200 Subject: [PATCH 2/8] Update lxml to 4.5.2 Allows to use wheel. --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 762acc6..8cbeb7b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,6 @@ argparse==1.2.1 distribute==0.6.24 -lxml==3.4.1 +lxml==4.5.2 psycopg2-binary==2.8.4 wsgiref==0.1.2 six==1.10.0 From 5eed26023c606d1d04850aefe1b2431c67b1e5fd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20BERSAC?= Date: Thu, 13 Aug 2020 08:17:32 +0200 Subject: [PATCH 3/8] Avoid confusion between libarchive and libarchive-c --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 36e9b2b..46e2e3b 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ You can use the script to download a given stackexchange compressed file from [archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load all the tables at once, using the `-s` switch. -You will need the `urllib` and `libarchive` modules. +You will need the `urllib` and `libarchive-c` modules. If you give a schema name using the `-n` switch, all the tables will be moved to the given schema. This schema will be created in the script. From a9f0e0086a7a0afd1d1ce6b50716296d0fcf4783 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20BERSAC?= Date: Thu, 13 Aug 2020 08:08:19 +0200 Subject: [PATCH 4/8] Install libarchive-c for downloader --- requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/requirements.txt b/requirements.txt index 8cbeb7b..2f01f60 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,6 @@ argparse==1.2.1 distribute==0.6.24 +libarchive-c==2.9 lxml==4.5.2 psycopg2-binary==2.8.4 wsgiref==0.1.2 From 8b0192bf421616f586a2b6f40228ace9515f590f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20BERSAC?= Date: Thu, 13 Aug 2020 08:28:02 +0200 Subject: [PATCH 5/8] Drop distribute This project is merged with setuptools. --- requirements.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 2f01f60..07fd573 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,4 @@ argparse==1.2.1 -distribute==0.6.24 libarchive-c==2.9 lxml==4.5.2 psycopg2-binary==2.8.4 From 3533c8211edafc93d0b05001d6c6f66730f47cbf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20BERSAC?= Date: Thu, 13 Aug 2020 08:55:54 +0200 Subject: [PATCH 6/8] Review README Document a quickstart setup first and then describe advanced usage for custom tables. --- README.md | 102 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 59 insertions(+), 43 deletions(-) diff --git a/README.md b/README.md index 46e2e3b..231d002 100644 --- a/README.md +++ b/README.md @@ -1,58 +1,74 @@ # StackOverflow data to postgres -This is a quick script to move the Stackoverflow data from the [StackExchange data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres SQL database. - -Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) and from [StackExchange Data Explorer](http://data.stackexchange.com). - -## Dependencies - - - [`lxml`](http://lxml.de/installation.html) - - [`psycopg2`](http://initd.org/psycopg/docs/install.html) - - [`libarchive-c`](https://pypi.org/project/libarchive-c/) - -## Usage - - - Create the database `stackoverflow` in your database: `CREATE DATABASE stackoverflow;` - - You can use a custom database name as well. Make sure to explicitly give - it while executing the script later. - - Move the following files to the folder from where the program is executed: - `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. - - In some old dumps, the cases in the filenames are different. - - Execute in the current folder (in parallel, if desired): - - `python load_into_pg.py -t Badges` - - `python load_into_pg.py -t Posts` - - `python load_into_pg.py -t Tags` (not present in earliest dumps) - - `python load_into_pg.py -t Users` - - `python load_into_pg.py -t Votes` - - `python load_into_pg.py -t PostLinks` - - `python load_into_pg.py -t PostHistory` - - `python load_into_pg.py -t Comments` - - Finally, after all the initial tables have been created: - - `psql stackoverflow < ./sql/final_post.sql` - - If you used a different database name, make sure to use that instead of - `stackoverflow` while executing this step. - - For some additional indexes and tables, you can also execute the the following; - - `psql stackoverflow < ./sql/optional_post.sql` - - Again, remember to user the correct database name here, if not `stackoverflow`. - -## Loading a complete stackexchange project - -You can use the script to download a given stackexchange compressed file from +This is a quick script to move the Stackoverflow data from the [StackExchange +data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres +SQL database. + +Schema hints are taken from [a post on +Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) +and from [StackExchange Data Explorer](http://data.stackexchange.com). + +## Quickstart + +Install requirements, create a `stackoverflow` database, and use +`load_into_pg.py` script: + +``` console +$ pip install -r requirements.txt +... +Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0 wsgiref-0.1.2 +$ createdb stackoverflow +$ python load_into_pg.py -s beer +``` + +This will download compressed files from [archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load -all the tables at once, using the `-s` switch. +all the tables at once. + + +## Advanced Usage + +You can use a custom database name as well. Make sure to explicitly give it +while executing the script later. + +Each table data is archived in an XML file. Available tables varies accross +history. `load_into_pg.py` knows how to handle the following tables: -You will need the `urllib` and `libarchive-c` modules. +- `Badges`. +- `Posts`. +- `Tags` (not present in earliest dumps). +- `Users`. +- `Votes`. +- `PostLinks`. +- `PostHistory`. +- `Comments`. + +You can download manually the files to the folder from where the program is +executed: `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. In +some old dumps, the cases in the filenames are different. + +Then load each file with e.g. `python load_into_pg.py -t Badges`. + +After all the initial tables have been created: + +``` console +$ psql stackoverflow < ./sql/final_post.sql +``` + +For some additional indexes and tables, you can also execute the the following; + +``` console +$ psql stackoverflow < ./sql/optional_post.sql +``` If you give a schema name using the `-n` switch, all the tables will be moved to the given schema. This schema will be created in the script. -To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute: -`./load_into_pg.py -s dba -n dba` - The paths are not changed in the final scripts `sql/final_post.sql` and `sql/optional_post.sql`. To run them, first set the _search_path_ to your schema name: `SET search_path TO ;` + ## Caveats and TODOs - It prepares some indexes and views which may not be necessary for your analysis. From 3468d8de3abeee2f1e6fefa3fcd4964703f471f3 Mon Sep 17 00:00:00 2001 From: Utkarsh Upadhyay <502876+musically-ut@users.noreply.github.com> Date: Sun, 30 Aug 2020 20:02:50 +0200 Subject: [PATCH 7/8] Change the example to use a different DB name. Also, removed mention of unnecessary dependency which was installed for Python 2x support. --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 231d002..f9026c4 100644 --- a/README.md +++ b/README.md @@ -16,9 +16,9 @@ Install requirements, create a `stackoverflow` database, and use ``` console $ pip install -r requirements.txt ... -Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0 wsgiref-0.1.2 -$ createdb stackoverflow -$ python load_into_pg.py -s beer +Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0 +$ createdb beerSO +$ python load_into_pg.py -s beer -d beerSO ``` This will download compressed files from @@ -52,13 +52,13 @@ Then load each file with e.g. `python load_into_pg.py -t Badges`. After all the initial tables have been created: ``` console -$ psql stackoverflow < ./sql/final_post.sql +$ psql beerSO < ./sql/final_post.sql ``` For some additional indexes and tables, you can also execute the the following; ``` console -$ psql stackoverflow < ./sql/optional_post.sql +$ psql beerSO < ./sql/optional_post.sql ``` If you give a schema name using the `-n` switch, all the tables will be moved @@ -84,3 +84,4 @@ schema name: `SET search_path TO ;` ## Acknowledgement [@madtibo](https://github.com/madtibo) made significant contributions by adding `jsonb` and Foreign Key support. +[@bersace](https://github.com/bersace) brought the dependencies and the `README.md` instructions into 2020. From ef4cf97afc53e36932cb0554720e721cc31b578f Mon Sep 17 00:00:00 2001 From: Utkarsh Upadhyay <502876+musically-ut@users.noreply.github.com> Date: Sun, 30 Aug 2020 20:03:22 +0200 Subject: [PATCH 8/8] Update requirements.txt Remove wsgiref which was required for Python 2 support. --- requirements.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 07fd573..10665d2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,5 +2,4 @@ argparse==1.2.1 libarchive-c==2.9 lxml==4.5.2 psycopg2-binary==2.8.4 -wsgiref==0.1.2 six==1.10.0