How to pull only modified commits/pull requests using github3.py?

Question

I'm writing an ETL job where I keep an updated list of commits, pull requests, and files from our GitHub repos in our data warehouse. I'm currently storing and passing in etags to the various iterators, but I don't think I'm understanding how to do it correctly.

I'm also having trouble understanding what the object.refresh(conditional=True) is doing exactly. If I iterate over all commits on a repo, and then call commit.refresh(conditional=True) on each one, will I receive a 304 exception to handle so I know not to include that commit in the data warehouse since it hasn't changed? Same goes for pull requests. When I call repository.refresh(conditional=True), it seems to ignore new commits in the repos.

If I pass an etag to repo.iter_commits, will it only return modified commits, or does it return all commits for the repo if there have been any changes at all?

This is the basic workflow I'm using currently:

from github3 import login

gh = login(token='access_token')
repos = (repo.refresh(conditional=True) for repo in gh.iter_repos(etag='previous_etag'))

commit_iters = (commit_iter for repo in repos for commit_iter in repo.iter_commits(etag='prev_etag'))

for commit_iter in commit_iters:
    for commit in commit_iter:
        commit.refresh(conditional=True)
        # pull various attributes, write to file, etc...

I'm wrapping each iterator in a wrapper class that handles retrieving previous etags, storing etags after iteration, and checking rate-limits.

My overarching goal is to pull any new commits/pull requests that have changes since my last request. I assume at that point I would want to delete the existing entry from the database and update with the new entry.

What is the proper and most efficient way to achieve this using the github3.py API?

EDIT: I checked the docs again, and there is a since parameter that will take care of my problem for commits. So I just need to know how to use etags properly to pull updated pull request data.

Ian Stapleton Cordasco · Accepted Answer · 2017-04-05 14:07:06Z

1

so ETags work in the following way:

You make a request and consume the resource and store the etag
You make a request with the ETag value
- If there is a change to the resource, you must consume the entire resource again
- If there is no change, you will receive a 204 No Content response

ETag does not allow you to resume from where you were and there's no good way to resume from where you left off with the API.

Honestly, what I think you might want to do is the following:

Consume all present commits on a repository
Register a webhook that subscribes to just the push event
Process the rest of the commits as people push them to GitHub.

answered Apr 5, 2017 at 14:07

Ian Stapleton Cordasco

29.2k4 gold badges73 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to pull only modified commits/pull requests using github3.py?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related