Guts of Git - A deep dive into the internals of the Git version control system
Introduction
I am not going to re-iterate what Git is and why it is so popular. There are plenty of articles and videos out there that do a great job of explaining that. I am going to assume that you already know what Git is and how to use it. If you don’t, I would recommend you to go through the official Git documentation and Pro Git book before reading this article.
The target audience for this article is the curious ones who wants to waddle in the internals of Git and learn how it works under the hood. For instance, we will be creating git objects with Python code and creating commits with git plumbing commands that are not meant to be used by end users. If you are not comfortable with git in the first place, this article is not for you.
How does Git store data?
When you run git init
in a directory, Git creates a .git
directory in that directory. This .git
directory is where Git stores all the data related to the repository. Let’s take a look at the contents of the .git
directory.
.git/
├── branches
├── config
├── description
├── HEAD
├── hooks
│ ├── applypatch-msg.sample
│ ├── commit-msg.sample
│ ├── fsmonitor-watchman.sample
│ ├── post-update.sample
│ ├── pre-applypatch.sample
│ ├── pre-commit.sample
│ ├── pre-merge-commit.sample
│ ├── prepare-commit-msg.sample
│ ├── pre-push.sample
│ ├── pre-rebase.sample
│ ├── pre-receive.sample
│ ├── push-to-checkout.sample
│ └── update.sample
├── info
│ └── exclude
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
The .git
directory contains a bunch of files and directories. Let’s take a look at each of them.
File/Dir | Contents |
---|---|
branches | directory that contains the list of branches in the repository. The branch info is stored in the form of a file with the branch name as the filename and the contents of the file being the commit hash of the latest commit in that branch. |
config | file that contains the configuration of the repository. This can be used to override the global configuration. this file also contains the remote repository information. |
description | file that contains the description of the repository. This is used by GitWeb to display the description of the repository. |
HEAD | file that contains the reference to the current branch. We will talk more about what a reference is later in this article. |
hooks | directory that contains the hooks that can be used to trigger custom actions at various stages of the Git workflow. We will not go into the details of hooks in this article. The .sample files are the sample hooks that can be used as a starting point for writing custom hooks. |
info | directory that contains the global exclude file. This file contains the list of files that should be ignored by Git. As the repository grows, additional information like the list of alternates and the list of grafts are stored in this directory. |
objects | directory that contains the actual data of the repository. This is where Git stores all the commits, trees, blobs, and tags. |
refs | directory that contains the references to the commits. We will talk more about what a reference is later in this article. |
How does Git store commits?
git objects
We will use a non-conventional approach to committing a file into git to understand how Git stores commits.
First, lets create a new file with some content in it.
echo "Hello World" > hello.txt
Now, we will use the git hash-object
command to create a blob object from the file.
$ git -w hash-object hello.txt
557db03de997c86a4a028e1ebd3a1ceb225be238
The -w
flag tells Git to write the object to the object database. The hash-object
command returns the SHA-1 hash of the object that was created. The SHA-1 hash of the object is the name of the file that is created in the object database. Let’s take a look at the contents of the object database.
.git/
├── branches
├── config
├── description
├── HEAD
├── hooks
├── info
│ └── exclude
├── objects
│ ├── 55
│ │ └── 7db03de997c86a4a028e1ebd3a1ceb225be238
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
Git stores contents in an object using its own custom format that includes a zlib compressed version of the contents, the type of the object, and the size of the contents. The type of the object is stored as a header in the object. some key type of the object are as follows (not all object types are covered).
Type | Description |
---|---|
blob | A blob object represents the contents of a file. |
tree | A tree object represents the contents of a directory. |
commit | A commit object represents a commit. |
tag | A tag object represents a tag. |
A python
routine to read the contents of the object is shown below.
import zlib
import hashlib
def read_object(sha):
with open('.git/objects/' + sha[:2] + '/' + sha[2:], 'rb') as f:
raw = zlib.decompress(f.read())
return raw
print(read_object('557db03de997c86a4a028e1ebd3a1ceb225be238')) # use .decode() to print the contents of the object
The oputput of the above program is shown below.
b'blob 12\x00Hello World\n'
We have just created a loose object that is not part of anything tracked by Git. But we can use teh git show
or git cat-file -p
command to view the contents of the object.
$ git cat-file -p 557db03d
Hello World
We can even use a python routine to create a loose object.
import zlib
import hashlib
import os
def write_object(data, type):
header = type + ' ' + str(len(data)) + '\x00'
store = header.encode() + data
sha = hashlib.sha1(store).hexdigest()
if not os.path.exists('.git/objects/' + sha[:2]):
os.makedirs('.git/objects/' + sha[:2])
with open('.git/objects/' + sha[:2] + '/' + sha[2:], 'wb') as out:
out.write(zlib.compress(store))
return sha
print(write_object(b'Hello New World\n', 'blob'))
The program will create a loose object in the object database and return the SHA-1 hash of the object.
d9786ef99a397ad94795405041cb9590712053f6
.git/
├── branches
├── config
├── description
├── HEAD
├── hooks
├── info
│ └── exclude
├── objects
│ ├── 55
│ │ └── 7db03de997c86a4a028e1ebd3a1ceb225be238
│ ├── d9
│ │ └── 786ef99a397ad94795405041cb9590712053f6
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
Contents of this new file can be viewed using the git cat-file -p
command.
$ git cat-file -p d9786ef9
Hello New World
Though there is a new object, there is no file in our working directory with this contents.
$ ls
hello.txt
As you would have noticed, there is no information about the file name or the directory structure in the object. This is because Git does not store the file name or the directory structure in the object. Git only stores the contents of the file in the object. The file name and the directory structure is stored in a tree object. Lets create a tree object and see how it is stored in the object database.
The tree object
Git stores a group of files and directories in a tree object. A tree object is a essentially a directory that contains other trees and blobs. Lets create a tree object that contains the two objects we created earlier.
For this, we need to first stage the two files in the index. We can do this using the git update-index
command.
$ git update-index --add --cacheinfo 100644 557db03d hello.txt
$ git update-index --add --cacheinfo 100644 d9786ef9 hello2.txt
In this case, you’re specifying a mode of 100644
, which means it’s a normal file. Other options are 100755
, which means it’s an executable file; and 120000
, which specifies a symbolic link. The mode is taken from normal UNIX modes but is much less flexible.
Now you can see that there is a new index
file in the .git
directory.
The index
file has a git internal format that is documented in the git documentation. We can use the git ls-files --stage
command to view the contents of the index file.
$ git ls-files --stage
100644 557db03de997c86a4a028e1ebd3a1ceb225be238 0 hello.txt
100644 d9786ef99a397ad94795405041cb9590712053f6 0 hello2.txt
Issuing a git status
will show that there are two files that are available to commit, even though there is no second file in the working directory.
$ git status
On branch main
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: hello.txt
new file: hello2.txt
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
deleted: hello2.txt
Since the hello2.txt
file is not present on disk, git sees it as a delete operation. We can use the git restore hello2.txt
command that will create the hello2.txt
file in the working directory with contents from the index
and the object
we created earlier. But this step is not necessary for us to create the tree object.
When we issue the git write-tree
command, git will create a tree object that contains the two files we added to the index. The tree object will be stored in the object database and the SHA-1 hash of the tree object will be returned.
$ git write-tree
60fdbb80045aca16edfa035e7a4b7b2ce5ebe5aa
We can use the git cat-file
command to view the contents of the tree object.
$ git cat-file -t 60fdbb80
tree
$ git cat-file -p 60fdbb80
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238 hello.txt
100644 blob d9786ef99a397ad94795405041cb9590712053f6 hello2.txt
The tree object contains the file mode, the type of the object (blob or tree) and the SHA-1
hash of the object.
The git read-tree
command can be used to read the contents of a tree object into the index. This is useful when you want to checkout a commit. The git read-tree
command will read the contents of the tree object into the index and the git checkout-index
command will create the files in the working directory.
$ git read-tree 60fdbb80
$ git checkout-index -a
commit objects
A commit object is a git object that contains the commit message, the author, the committer and the tree object that represents the contents of the commit. Lets create a commit object that contains the tree object we created earlier.
$ git commit-tree 60fdbb80 -m "Initial commit"
efb4ebf62f7ec3e9e078f232ef0f00a175140046
Ans the commit object contents can be viewed using the git cat-file
command.
$ git cat-file -p efb4ebf6
tree 60fdbb80045aca16edfa035e7a4b7b2ce5ebe5aa
author vpillai <vysakhpillai@embeddedinn.xyz> 1686972765 -0700
committer vpillai <vysakhpillai@embeddedinn.xyz> 1686972765 -0700
Initial commit
We can also use the git log
command to view the commit history.
$ git log --stat efb4ebf6
Author: vpillai <vysakhpillai@embeddedinn.xyz>
Date: Fri Jun 16 20:32:45 2023 -0700
Initial commit
hello.txt | 1 +
hello2.txt | 1 +
2 files changed, 2 insertions(+)
But git status
still reports that there are no commits.
$ git status
On branch main
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: hello.txt
new file: hello2.txt
This is because the HEAD
reference is still pointing to the main
branch that does not have a corresponding ref entry. We can use the git update-ref
command to update the HEAD
reference to point to the commit object we created earlier.
$ git update-ref refs/heads/main efb4ebf6
This creates a new file in the .git/refs/heads
directory that contains the SHA-1
hash of the commit object.
$ cat .git/refs/heads/main
efb4ebf62f7ec3e9e078f232ef0f00a175140046
An entry is also created in the logs/refs/heads
directory that contains the SHA-1
hash of the commit object and the commit message.
$ cat .git/logs/refs/heads/main
0000000000000000000000000000000000000000 efb4ebf62f7ec3e9e078f232ef0f00a175140046 vpillai <vysakhpillai@embeddedinn.xyz> 1686973167 -0700
This entry says that main moved from 00
to efb4ebf6
. The 00
is the SHA-1
hash of the empty tree object. The git log
command will now show the commit we created earlier.
$ git log
commit efb4ebf62f7ec3e9e078f232ef0f00a175140046 (HEAD -> main)
Author: vpillai <vysakhpillai@embeddedinn.xyz>
Date: Fri Jun 16 20:32:45 2023 -0700
Initial commit
refs
The refs
directory contains the references to the commit objects. The HEAD
reference is a symbolic reference that points to the current branch. The HEAD
reference is stored in the .git/HEAD
file.
$ cat .git/HEAD
ref: refs/heads/main
The refs/heads
directory contains the references to the branches. The refs/tags
directory contains the references to the tags. The refs/remotes
directory contains the references to the remote branches.
branches
A branch is a reference to a commit object. When a new branch is created, a new file is created in the .git/refs/heads
directory that contains the SHA-1
hash of the commit object. Lets create a new branch called dev
and checkout the branch using low level git commands.
$ git update-ref refs/heads/dev efb4ebf6
A new branch is now created
$ git branch
dev
* main
This creates a new file in the .git/refs/heads
directory and the .git/log/refs/heads
directory that contains the SHA-1
hash of the commit object.
.git/
├── branches
├── config
├── description
├── HEAD
├── hooks
├── index
├── info
│ └── exclude
├── logs
│ ├── HEAD
│ └── refs
│ └── heads
│ ├── dev
│ └── main
├── objects
│ ├── 55
│ │ └── 7db03de997c86a4a028e1ebd3a1ceb225be238
│ ├── 60
│ │ └── fdbb80045aca16edfa035e7a4b7b2ce5ebe5aa
│ ├── d9
│ │ └── 786ef99a397ad94795405041cb9590712053f6
│ ├── ef
│ │ └── b4ebf62f7ec3e9e078f232ef0f00a175140046
│ ├── info
│ └── pack
└── refs
├── heads
│ ├── dev
│ └── main
└── tags
At this stage, main and dev are pointing to the same commit object. Checking out the new branch simply means updating the HEAD
reference to point to the new branch. Instead of using git update-ref HEAD refs/heads/dev
, we can simply update the contents of the .git/HEAD
file to point git to the new branch.
$ echo "ref: refs/heads/dev" > .git/HEAD
$ git branch
* dev
main
Now, we can create a new commit on the dev
branch, but stil using the low level git commands.
#update the file
$ echo "Hello World Uno" > hello.txt
# create a new object for the file
$ git hash-object -w hello.txt
2a323159bea5a5bf98c0ccaef350cd6141f0f3df
$ git update-index --add --cacheinfo 100644 2a323159 hello.txt
$ git status
On branch dev
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: hello.txt
# add sub tree
$ git write-tree
e0aefbba82dd2e7653ae6d46f00bbed584fac52f
# commit tree with parent
$ git commit-tree e0aefbba -p efb4ebf6 -m "Commit to dev"
152b7866c2e126eec65bafec327d9d760bef99c7
# make dev point to the new commit
$ git update-ref refs/heads/dev 152b7866
# check status
$ git status
On branch dev
nothing to commit, working tree clean
# check log
$ cat .git/logs/refs/heads/dev
0000000000000000000000000000000000000000 efb4ebf62f7ec3e9e078f232ef0f00a175140046 vpillai <vysakhpillai@embeddedinn.xyz> 1686974500 -0700
efb4ebf62f7ec3e9e078f232ef0f00a175140046 152b7866c2e126eec65bafec327d9d760bef99c7 vpillai <vysakhpillai@embeddedinn.xyz> 1686974696 -0700
Diff
Lets look a how Git handles incremental changes. For this, we will start with a clean repository.
$ git init
$ echo "Hello World" > hello.txt
$ git add hello.txt
$ git commit -m "Initial commit"
.git/
├── branches
├── COMMIT_EDITMSG
├── config
├── description
├── HEAD
├── hooks
├── index
├── info
│ └── exclude
├── logs
│ ├── HEAD
│ └── refs
│ └── heads
│ └── main
├── objects
│ ├── 3c
│ │ └── 80f66564c9ee69d0987e48a545becc3025deb1
│ ├── 55
│ │ └── 7db03de997c86a4a028e1ebd3a1ceb225be238
│ ├── 97
│ │ └── b49d4c943e3715fe30f141cc6f27a8548cee0e
│ ├── info
│ └── pack
└── refs
├── heads
│ └── main
└── tags
# check the tree contents
$ git cat-file -p 97b49d4c
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238 hello.txt
Now, lets update the file and commit the changes.
$ echo "Hello World Uno" > hello.txt
$ git add hello.txt
$ git commit -m "Update hello.txt"
This created 3 new objects:
Type | Object hash |
---|---|
blob | 2a323159 |
tree | 106ed651 |
commit | 06a73f25 |
The blob contains the entire contents of the updated file (not just the diff). The tree contains the updated blob and the commit contains the updated tree.
$ git cat-file -p 557db03d
Hello World
$ git cat-file -p 2a323159
Hello World Uno
git log
shows the commit history including the two commit objects. they are linked by the parent
field in the commit object.
$ git cat-file -p 06a73f25
tree 106ed65198ebfbde9f4e7e8bd6ceb2dd2e5268ce
parent 3c80f66564c9ee69d0987e48a545becc3025deb1
author vpillai <vysakhpillai@embeddedinn.xyz> 1686976022 -0700
committer vpillai <vysakhpillai@embeddedinn.xyz> 1686976022 -0700
This tree information can be visualized using standard git commands.
$ git log --graph --oneline --decorate --all
* 06a73f2 (HEAD -> main) Update hello.txt
* 3c80f66 Initial commit
Conclusion
This article is a brief introduction to the internals of Git. It is by no means a complete guide. The goal was to understand the basic concepts and to get a feel for how basic Git operations works. The next article will cover the internals of the git add
command and how it interacts with the index
and the working tree.
Leave a comment