How to reduce Git repo size with git filter-branch


Written by - Deepak Prasad

In this tutorial, we'll learn how to shrink or reduce Git repo size using git filter-branch. We'll cover identifying large files, safely removing them, and cleaning up to maintain a more efficient and manageable repository.

In the world of coding, Git is like a magical diary, keeping track of all the changes we make in our projects. But sometimes, this diary gets too full, making it heavy and slow to use. This happens when our Git repository, where all our project's history is stored, becomes bloated with too many or too large files. It's like having a backpack full of unnecessary things, making it hard to carry around.

To deal with this, we have a special tool in Git called git filter-branch. Think of it as a magic wand that lets us go back in time and remove things we no longer need from our diary. By using it, we can make our Git repository lighter and more efficient, like cleaning out that overloaded backpack.

DISCLAIMER:
Using git filter-branch is a bit like performing a delicate magic trick. If not done carefully, it can lead to losing important parts of our project's history. That's why it's super important to create a backup of our entire project before we start. This way, if something goes wrong, we still have everything saved and can try again.

 

Steps to reduce git repo size using filter-branch

1. Checking Git Repository Size

Assuming you have already cloned your repository, open your command line and navigate to your repository's root directory. Run this simple command to see the size:

$ du -sh .git
723M    .git

This command will display the size of the .git folder, which is where all your repository's history and data are stored. In my case it was around 723MB.

 

2. Take Backup of your Git Repo

Before diving into cleaning up our Git repository, it's like preparing for a big cleaning day in our house. We need to make sure all our valuable items are safe. In the Git world, this means creating a backup of our entire repository. This is super important because if something unexpected happens during the cleanup, we don't lose our precious work.

To create a backup, we simply make a complete copy of our repository. Open your command line, navigate to the directory containing your repository, and run:

git clone --mirror [your-repo-url] [backup-repo-name].git

Replace [your-repo-url] with your repository's URL and [backup-repo-name] with a name for your backup repository. This command creates a full mirror of your repository, including all branches and tags.

 

3. Identifying Large Files in Your Repository

Execute the following set of commands from the terminal inside your git repository:

git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
sed -n 's/^blob //p' |
sort --numeric-sort --key=2 |
cut -c 1-12,41- |
$(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

This magical incantation tells Git to list all the objects in the repository, along with their sizes.

After running the script, you'll see a list of files. Each line shows a file's unique identifier, its size, and its location in your repository. Now, look through the list and decide which files to remove. Usually, these are very large files that aren't needed in your project's history, like old database backups or large media files. Think carefully about each file - removing it will erase it from your project's history. If you're unsure, it's always safer to check with your team or keep the file.

In my case I found these binaries which have been compiled and stored in my git history which I don't need anymore and they seem to be taking 42Mb for each commit:

...
836f2f1e9cc5   12MiB my-repo/CKEY/libnss_ckey.so.2
07803f403369   12MiB my-repo/CKEY/libnss_ckey.so.2
4895a2ac88ac   12MiB my-repo/CKEY/libnss_ckey.so.2
50f5f8313da7   12MiB my-repo/CKEY/libnss_ckey.so.2
221021f8f8cf   12MiB my-repo/CKEY/libnss_ckey.so.2
edb181368bf9   12MiB my-repo/CKEY/libnss_ckey.so.2
62bdefc78547   12MiB my-repo/CKEY/libnss_ckey.so.2
...

 

4. Using git filter-branch to Remove Unwanted Data

Now, to start cleaning, we use the git filter-branch command. This powerful tool allows us to rewrite history by removing files we don't need anymore.

git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch PATH-TO-YOUR-FILE' \
--prune-empty --tag-name-filter cat -- --all

Replace PATH-TO-YOUR-FILE with the path to the file you identified as large and unnecessary. In my case it is my-repo/reloader/manager, so let's update the command:

git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch my-repo/reloader/manager' \
--prune-empty --tag-name-filter cat -- --all

After executing the above command, you may see all the provided files getting removed from the history:

Rewrite f038a7a47a935fca4f3fa1285a98ff525700d4fc (1704/3820) (52 seconds passed, remaining 64 predicted)    rm 'my-repo/CKEY/libnss_ckey.so.2'
Rewrite bededf0e5dabc309906a4ddd39e7eeb00e71d457 (1704/3820) (52 seconds passed, remaining 64 predicted)    rm 'my-repo/CKEY/libnss_ckey.so.2'
Rewrite 2d77879248794ab3e99272d502fd3a404a8d8e71 (1704/3820) (52 seconds passed, remaining 64 predicted)    rm 'my-repo/CKEY/libnss_ckey.so.2'
Rewrite f7071f137227b55490920174c196e319779a15fc (1704/3820) (52 seconds passed, remaining 64 predicted)    rm 'my-repo/CKEY/libnss_ckey.so.2'
Rewrite eb21b67f5ca086685169905bde4e7d91a2670597 (1704/3820) (52 seconds passed, remaining 64 predicted)    rm 'my-repo/CKEY/libnss_ckey.so.2'
Rewrite 26cec87fe4c7efe7a1a0d17a7ae9a11b37e7df79 (1704/3820) (52 seconds passed, remaining 64 predicted)    rm 'my-repo/CKEY/libnss_ckey.so.2'

 

Impact on Repository History

Using git filter-branch is like going back in time and erasing something from your past. It completely removes the specified files from your repository's history. This means that past commits that included these files will be rewritten as if the files were never there. It's a powerful tool, but with great power comes great responsibility:

  1. Rewriting History: This alters the history of your repository. If you're working in a team, this change can cause significant issues for everyone else's local copies. They may need to re-clone the repository or handle complex merges.
  2. Data Loss: Once a file is removed from history, it's gone for good (unless you have a backup). Make sure the files you're removing are not needed.
  3. force pushing Required: After cleaning your repository, you'll need to force push these changes to your remote repository. This can disrupt other collaborators working on the project.

 

5. Cleaning and Finalizing the Changes

After git filter-branch, redundant data still lingers in refs of original branches and tags. To clean these:

git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin

This command locates and deletes references to the original data before the filter-branch operation, helping prevent any confusion with the rewritten history.

The reflog records when the tips of branches and other references were updated in the repository. To clear these records and further reduce repo size:

git reflog expire --expire=now --all

This step ensures that entries in the reflog that might refer to the old (pre-rewrite) history are removed, further cleaning the repository.

Finally, perform garbage collection with:

git gc --prune=now --aggressive

This command is the final sweep-up, removing objects that are no longer in use and optimizing the repository. The --aggressive flag makes git gc work harder to compress data, and --prune=now removes any loose objects immediately.

Sample Output:

Enumerating objects: 28807, done.
Counting objects: 100% (28807/28807), done.
Delta compression using up to 4 threads
Compressing objects: 100% (27748/27748), done.
Writing objects: 100% (28807/28807), done.
Total 28807 (delta 16893), reused 10665 (delta 0), pack-reused 0

 

Why This Comprehensive Cleanup Is Necessary

  • Remove Redundancies: Ensures all layers of historical data that are no longer necessary are removed.
  • Optimize Storage: Maximizes the efficiency of the repository, making future operations faster.
  • Reclaim Space: The main goal is to see a reduction in the physical size of the .git folder on your drive.

 

6. Verifying the New Repository Size

After completing these steps, check the size of your .git folder again:

$ du -sh .git
484M    .git

You should see a significant reduction in size compared to the initial measurement. This indicates a successful cleanup, resulting in a leaner, more efficient repository. As you can see in my case the git repo size is reduced from 723MB to 484MB

 

7. Pushing Changes and Post-Cleanup Actions

Once you've successfully cleaned your Git repository, it's time to share these changes with your team. This involves pushing your changes to the remote repository, but remember, this isn't just a regular push - it's a force push, because you've rewritten history.

git push origin --force --all
git push origin --force --tags

This will replace the history on the remote with your newly rewritten history. Be cautious, as this can affect others working on the repository.

 

Conclusion

Maintaining a manageable size for your Git repository is crucial for efficiency and performance. A bloated repository can slow down operations, making it harder for your team to work effectively. Regular maintenance and monitoring of the repository size ensure that it remains lean and manageable. This includes periodically removing unnecessary large files and optimizing the repository's history.

Using git filter-branch is a powerful method for reducing repository size, but it should be approached with caution due to its potential to rewrite history. Always back up your repository before making significant changes.

For those interested in diving deeper into git filter-branch and its alternatives, the Git documentation offers comprehensive guides and best practices :

 

Deepak Prasad

He is the founder of GoLinuxCloud and brings over a decade of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive experience, he excels in various domains, from development to DevOps, Networking, and Security, ensuring robust and efficient solutions for diverse projects. You can reach out to him on his LinkedIn profile or join on Facebook page.

Categories GIT

Can't find what you're searching for? Let us assist you.

Enter your query below, and we'll provide instant results tailored to your needs.

If my articles on GoLinuxCloud has helped you, kindly consider buying me a coffee as a token of appreciation.

Buy GoLinuxCloud a Coffee

For any other feedbacks or questions you can send mail to admin@golinuxcloud.com

Thank You for your support!!