I'm writing an ETL job where I keep an updated list of commits, pull requests, and files from our GitHub repos in our data warehouse. I'm currently storing and passing in etags to the various iterators, but I don't think I'm understanding how to do it correctly.
I'm also having trouble understanding what the object.refresh(conditional=True) is doing exactly. If I iterate over all commits on a repo, and then call commit.refresh(conditional=True) on each one, will I receive a 304 exception to handle so I know not to include that commit in the data warehouse since it hasn't changed? Same goes for pull requests. When I call repository.refresh(conditional=True), it seems to ignore new commits in the repos.
If I pass an etag to repo.iter_commits, will it only return modified commits, or does it return all commits for the repo if there have been any changes at all?
This is the basic workflow I'm using currently:
from github3 import login
gh = login(token='access_token')
repos = (repo.refresh(conditional=True) for repo in gh.iter_repos(etag='previous_etag'))
commit_iters = (commit_iter for repo in repos for commit_iter in repo.iter_commits(etag='prev_etag'))
for commit_iter in commit_iters:
for commit in commit_iter:
commit.refresh(conditional=True)
# pull various attributes, write to file, etc...
I'm wrapping each iterator in a wrapper class that handles retrieving previous etags, storing etags after iteration, and checking rate-limits.
My overarching goal is to pull any new commits/pull requests that have changes since my last request. I assume at that point I would want to delete the existing entry from the database and update with the new entry.
What is the proper and most efficient way to achieve this using the github3.py API?
EDIT:
I checked the docs again, and there is a since parameter that will take care of my problem for commits. So I just need to know how to use etags properly to pull updated pull request data.