4

I was asked this question on #git earlier but as its reasonably substantial I'll post it up here. I want to run a filter-branch on a repo to modify (thousands of) files over hundreds of commits using a python script. I'm calling the clean.py script using the following command in the repo directory:

git filter-branch -f --tree-filter '(cd ../cleaner/ && python clean.py --path=files/*/*/**)'

Clean.py looks like this and will modify all files in path (i.e. files/*/*/**):

from os import environ as environment
import argparse, yaml
import logging
from cleaner import Cleaner

parser = argparse.ArgumentParser()
parser.add_argument("--path", help="path to run cleaner on", type=str)
args = parser.parse_args()

# logging.basicConfig(level=logging.DEBUG)

with open("config.yml") as sets:
    config = yaml.load(sets)

path = args.path
if not path:
    path = config["cleaner"]["general_pattern"]

cleaner = Cleaner(config["cleaner"])

print "Cleaning path: " + str(path)
cleaner.clean(path, True)

After running the command the following is outputted to terminal:

$ python deploy.py --verbose
INFO:root:Checked out master branch
INFO:root:Running command:
'git filter-branch -f --tree-filter '(cd C:/Users/Graeme/Documents/programming/clean-cdn/clean-jsdelivr/ && python clean.py --path=files/*/*/**)' -d "../tmp"' in ../jsdelivr
Rewrite 298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e (1/1535)
Cleaning path: files/*/*/**

C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 343: ../commit: No such file or directory
C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 346: ../map/298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e
: No such file or directory
could not write rewritten commit
rm: cannot remove `/c/Users/Graeme/Documents/programming/clean-cdn/tmp/revs': Permission denied
rm: cannot remove directory `/c/Users/Graeme/Documents/programming/clean-cdn/tmp': Directory not empty

The python script executes successfully and modifies the files correctly but the filter-branch doesn't finish fixing up the commit. There appears to be a permission issue however I haven't been able to get around it running with elevated privileges. I've tried running the filter-branch on win7, win8, and ubuntu with git v1.8 and v1.9.
Edit The script works as is on Centros with git1.7.1

The goal is to reduce the size of a CDNs repo (nearing 1GB) after the contents in files/*/*/** finishes syncing with a database.
The source code of the project
Target repo for the rewrite

3
  • what is the output of git --version? Commented Mar 30, 2014 at 7:08
  • Can you clarify what repo it is you're looking to clean? Is it github.com/jsdelivr/jsdelivr (current pack size ~284MB)? Commented Mar 30, 2014 at 10:05
  • @michas I've tried running this on v1.9.0, v1.8.5 and 1.8.3. Yes thats the right repo Roberto Commented Mar 30, 2014 at 12:25

3 Answers 3

2
+400

The permissions issue you're encountering is interesting-are you doing this on a local copy of the repo (ie one where you have full access to the filesystem), or on a remote server?

Reading over your python code, it looks like you're trying to remove every file over a certain size that is not a .INI file, did I get that right?

If that's the case, can I ask if you've considered The BFG Repo-Cleaner? Obviously, you learn a lot about Git by writing your own code (I know I have), but I think The BFG is probably tailor-made for your needs - and will be faster than any git-filter-branch based approach.

In your case, you might want to run it with a command like:

$ java -jar bfg.jar --strip-blobs-bigger-than 100K  my-repo.git

This removes all blobs bigger than 100K, that aren't in your latest commit.

I did a quick run with this on the jsdelivr repo, and reduced pack size from 284M to 138M in the cleaned repo. The BFG cleaning step took under 5 seconds, the subsequent git gc --prune=now --aggressive just under 2 minutes.

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Sign up to request clarification or add additional context in comments.

15 Comments

Also our current files aren't sacred - is there anyway to have your tool hit all commits to HEAD
Re the sacred: --no-blob-protection is your (scary) friend!
Alright neat - looks promising. Anyway to specify the ***REMOVED*** text and does your project support globbed paths?
Ah thanks, context is good! It wouldn't be hard to change the BFG to zero the files (github.com/rtyley/bfg-repo-cleaner/blob/ed21bed/bfg-library/src/… ), but from reading issue 347, I don't think it's essential to the spirit of what you're trying to do - replacement files called 'filename.REMOVED.git-id' would be fine I think. Overall, I'm not sure that /frequent/ history rewrites would be good for the jsdelivr project tho' - would make it rather confusing for people submitting pull-requests?
Regarding byte-size - I've just cut release v1.11.3 of The BFG, with support for filtering files by single-byte filesizes! Will be visible at repo1.maven.org/maven2/com/madgag/bfg within a few hours.
|
1

You should not cd to another directory as the git-filter-branch script will use relative paths to access the files.

1 Comment

The script loads some .yml files in its relative directory and filter branch executes the command in the context of the repos path. AFAIK theres no way to set a cwd path
0

Consider using BFG. It is much faster and simpler to use.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.