0

I'm writing an ETL job where I keep an updated list of commits, pull requests, and files from our GitHub repos in our data warehouse. I'm currently storing and passing in etags to the various iterators, but I don't think I'm understanding how to do it correctly.

I'm also having trouble understanding what the object.refresh(conditional=True) is doing exactly. If I iterate over all commits on a repo, and then call commit.refresh(conditional=True) on each one, will I receive a 304 exception to handle so I know not to include that commit in the data warehouse since it hasn't changed? Same goes for pull requests. When I call repository.refresh(conditional=True), it seems to ignore new commits in the repos.

If I pass an etag to repo.iter_commits, will it only return modified commits, or does it return all commits for the repo if there have been any changes at all?

This is the basic workflow I'm using currently:

from github3 import login

gh = login(token='access_token')
repos = (repo.refresh(conditional=True) for repo in gh.iter_repos(etag='previous_etag'))

commit_iters = (commit_iter for repo in repos for commit_iter in repo.iter_commits(etag='prev_etag'))

for commit_iter in commit_iters:
    for commit in commit_iter:
        commit.refresh(conditional=True)
        # pull various attributes, write to file, etc...

I'm wrapping each iterator in a wrapper class that handles retrieving previous etags, storing etags after iteration, and checking rate-limits.

My overarching goal is to pull any new commits/pull requests that have changes since my last request. I assume at that point I would want to delete the existing entry from the database and update with the new entry.

What is the proper and most efficient way to achieve this using the github3.py API?

EDIT: I checked the docs again, and there is a since parameter that will take care of my problem for commits. So I just need to know how to use etags properly to pull updated pull request data.

1 Answer 1

1

so ETags work in the following way:

  1. You make a request and consume the resource and store the etag

  2. You make a request with the ETag value

    • If there is a change to the resource, you must consume the entire resource again

    • If there is no change, you will receive a 204 No Content response

ETag does not allow you to resume from where you were and there's no good way to resume from where you left off with the API.

Honestly, what I think you might want to do is the following:

  1. Consume all present commits on a repository
  2. Register a webhook that subscribes to just the push event
  3. Process the rest of the commits as people push them to GitHub.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.