In this tutorial, we'll learn how to shrink or reduce Git repo size using git filter-branch
. We'll cover identifying large files, safely removing them, and cleaning up to maintain a more efficient and manageable repository.
In the world of coding, Git is like a magical diary, keeping track of all the changes we make in our projects. But sometimes, this diary gets too full, making it heavy and slow to use. This happens when our Git repository, where all our project's history is stored, becomes bloated with too many or too large files. It's like having a backpack full of unnecessary things, making it hard to carry around.
To deal with this, we have a special tool in Git called git filter-branch
. Think of it as a magic wand that lets us go back in time and remove things we no longer need from our diary. By using it, we can make our Git repository lighter and more efficient, like cleaning out that overloaded backpack.
Steps to reduce git repo size using filter-branch
1. Checking Git Repository Size
Assuming you have already cloned your repository, open your command line and navigate to your repository's root directory. Run this simple command to see the size:
$ du -sh .git 723M .git
This command will display the size of the .git
folder, which is where all your repository's history and data are stored. In my case it was around 723MB.
2. Take Backup of your Git Repo
Before diving into cleaning up our Git repository, it's like preparing for a big cleaning day in our house. We need to make sure all our valuable items are safe. In the Git world, this means creating a backup of our entire repository. This is super important because if something unexpected happens during the cleanup, we don't lose our precious work.
To create a backup, we simply make a complete copy of our repository. Open your command line, navigate to the directory containing your repository, and run:
git clone --mirror [your-repo-url] [backup-repo-name].git
Replace [your-repo-url]
with your repository's URL and [backup-repo-name]
with a name for your backup repository. This command creates a full mirror of your repository, including all branches and tags.
3. Identifying Large Files in Your Repository
Execute the following set of commands from the terminal inside your git repository:
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
This magical incantation tells Git to list all the objects in the repository, along with their sizes.
After running the script, you'll see a list of files. Each line shows a file's unique identifier, its size, and its location in your repository. Now, look through the list and decide which files to remove. Usually, these are very large files that aren't needed in your project's history, like old database backups or large media files. Think carefully about each file - removing it will erase it from your project's history. If you're unsure, it's always safer to check with your team or keep the file.
In my case I found these binaries which have been compiled and stored in my git history which I don't need anymore and they seem to be taking 42Mb for each commit:
... 836f2f1e9cc5 12MiB my-repo/CKEY/libnss_ckey.so.2 07803f403369 12MiB my-repo/CKEY/libnss_ckey.so.2 4895a2ac88ac 12MiB my-repo/CKEY/libnss_ckey.so.2 50f5f8313da7 12MiB my-repo/CKEY/libnss_ckey.so.2 221021f8f8cf 12MiB my-repo/CKEY/libnss_ckey.so.2 edb181368bf9 12MiB my-repo/CKEY/libnss_ckey.so.2 62bdefc78547 12MiB my-repo/CKEY/libnss_ckey.so.2 ...
4. Using git filter-branch to Remove Unwanted Data
Now, to start cleaning, we use the git filter-branch
command. This powerful tool allows us to rewrite history by removing files we don't need anymore.
git filter-branch --force --index-filter \ 'git rm --cached --ignore-unmatch PATH-TO-YOUR-FILE' \ --prune-empty --tag-name-filter cat -- --all
Replace PATH-TO-YOUR-FILE
with the path to the file you identified as large and unnecessary. In my case it is my-repo/reloader/manager
, so let's update the command:
git filter-branch --force --index-filter \ 'git rm --cached --ignore-unmatch my-repo/reloader/manager' \ --prune-empty --tag-name-filter cat -- --all
After executing the above command, you may see all the provided files getting removed from the history:
Rewrite f038a7a47a935fca4f3fa1285a98ff525700d4fc (1704/3820) (52 seconds passed, remaining 64 predicted) rm 'my-repo/CKEY/libnss_ckey.so.2' Rewrite bededf0e5dabc309906a4ddd39e7eeb00e71d457 (1704/3820) (52 seconds passed, remaining 64 predicted) rm 'my-repo/CKEY/libnss_ckey.so.2' Rewrite 2d77879248794ab3e99272d502fd3a404a8d8e71 (1704/3820) (52 seconds passed, remaining 64 predicted) rm 'my-repo/CKEY/libnss_ckey.so.2' Rewrite f7071f137227b55490920174c196e319779a15fc (1704/3820) (52 seconds passed, remaining 64 predicted) rm 'my-repo/CKEY/libnss_ckey.so.2' Rewrite eb21b67f5ca086685169905bde4e7d91a2670597 (1704/3820) (52 seconds passed, remaining 64 predicted) rm 'my-repo/CKEY/libnss_ckey.so.2' Rewrite 26cec87fe4c7efe7a1a0d17a7ae9a11b37e7df79 (1704/3820) (52 seconds passed, remaining 64 predicted) rm 'my-repo/CKEY/libnss_ckey.so.2'
Impact on Repository History
Using git filter-branch
is like going back in time and erasing something from your past. It completely removes the specified files from your repository's history. This means that past commits that included these files will be rewritten as if the files were never there. It's a powerful tool, but with great power comes great responsibility:
- Rewriting History: This alters the history of your repository. If you're working in a team, this change can cause significant issues for everyone else's local copies. They may need to re-clone the repository or handle complex merges.
- Data Loss: Once a file is removed from history, it's gone for good (unless you have a backup). Make sure the files you're removing are not needed.
- force pushing Required: After cleaning your repository, you'll need to force push these changes to your remote repository. This can disrupt other collaborators working on the project.
5. Cleaning and Finalizing the Changes
After git filter-branch
, redundant data still lingers in refs of original branches and tags. To clean these:
git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
This command locates and deletes references to the original data before the filter-branch
operation, helping prevent any confusion with the rewritten history.
The reflog records when the tips of branches and other references were updated in the repository. To clear these records and further reduce repo size:
git reflog expire --expire=now --all
This step ensures that entries in the reflog that might refer to the old (pre-rewrite) history are removed, further cleaning the repository.
Finally, perform garbage collection with:
git gc --prune=now --aggressive
This command is the final sweep-up, removing objects that are no longer in use and optimizing the repository. The --aggressive
flag makes git gc
work harder to compress data, and --prune=now
removes any loose objects immediately.
Sample Output:
Enumerating objects: 28807, done. Counting objects: 100% (28807/28807), done. Delta compression using up to 4 threads Compressing objects: 100% (27748/27748), done. Writing objects: 100% (28807/28807), done. Total 28807 (delta 16893), reused 10665 (delta 0), pack-reused 0
Why This Comprehensive Cleanup Is Necessary
- Remove Redundancies: Ensures all layers of historical data that are no longer necessary are removed.
- Optimize Storage: Maximizes the efficiency of the repository, making future operations faster.
- Reclaim Space: The main goal is to see a reduction in the physical size of the
.git
folder on your drive.
6. Verifying the New Repository Size
After completing these steps, check the size of your .git
folder again:
$ du -sh .git 484M .git
You should see a significant reduction in size compared to the initial measurement. This indicates a successful cleanup, resulting in a leaner, more efficient repository. As you can see in my case the git repo size is reduced from 723MB to 484MB
7. Pushing Changes and Post-Cleanup Actions
Once you've successfully cleaned your Git repository, it's time to share these changes with your team. This involves pushing your changes to the remote repository, but remember, this isn't just a regular push - it's a force push, because you've rewritten history.
git push origin --force --all git push origin --force --tags
This will replace the history on the remote with your newly rewritten history. Be cautious, as this can affect others working on the repository.
Conclusion
Maintaining a manageable size for your Git repository is crucial for efficiency and performance. A bloated repository can slow down operations, making it harder for your team to work effectively. Regular maintenance and monitoring of the repository size ensure that it remains lean and manageable. This includes periodically removing unnecessary large files and optimizing the repository's history.
Using git filter-branch
is a powerful method for reducing repository size, but it should be approached with caution due to its potential to rewrite history. Always back up your repository before making significant changes.
For those interested in diving deeper into git filter-branch
and its alternatives, the Git documentation offers comprehensive guides and best practices: