Git and libgit2

Working With Git Repos In Hostile Environments

Published 2023-07-13. Last modified 2023-10-06.
Time to read: 6 minutes.

This page is part of the git collection.

When on an expert witness assignment, I often inspect git repositories provided by the opposing party's lawyers. Quite often, those repositories have issues that seriously impede git operation. Although the motivation for a legal team to hamper the work of an opposing expert is readily apparent, often it is technical ignorance and not malice that causes problems.

This page contains my notes on:

  1. Preventing problems for parties that need to share a git repository
  2. Overcoming issues with git repositories provided by the other party, whether deliberately caused or due to honest mistakes

Typical Git Files and Directories

Following are the files and directories provided after a typical git init, and checking in a few files. The 256 subdirectories of the objects/ directory are not shown for clarity. The hooks/ and logs/refs/remotes directories are also not shown, since they are unnecessary when there is no internet access.

Shell
$ find .git -type d \
  -not -path ".git/objects/*" \
  -not -path ".git/hooks*" \
  -not -path ".git/logs/refs/remotes*" | \
  sed -E 's^.git/?^^' | \
  column -c 80
HEAD                            info/exclude
COMMIT_EDITMSG                  objects
ORIG_HEAD                       index
config                          refs
logs                            refs/remotes
logs/HEAD                       refs/remotes/origin
logs/refs                       logs/refs/heads
refs/remotes/origin/master      refs/tags
logs/refs/heads/master          refs/heads
description                     branches
refs/heads/master               info
FETCH_HEAD 

Maximizing Good Will

This section is dedicated to preventing problems for parties that need to share a git repository.

Litigation is by nature an adversarial activity. When computer software changes evidence without notice or warning, people tend blame each other.

A dangling commit is a commit that is unreachable from any other commit. One way to make a dangling commit is to make a commit on a detached head.

Git runs garbage collection periodically. This happens without warning. One of the functions that git garbage collection performs is to delete (prune) all dangling commits. There is no message to alert the user that dangling commits were found or that they were pruned.

This is can lead to investigators experiencing files disappearing from a git repo after a period of time, as if they were written in the digital equivalent of disappearing ink. The party that obtained the git repository might accusing the other party of destroying evidence.

To avoid this potentially very damaging accusation, 3 actions should be performed before giving a git repository to another party:

  1. Name any dangling commits that you want to preserve
  2. Verify the integrity of the git object database
  3. Run the garbage collection with extra care

Naming Dangling Commits

If you want to preserve a dangling commit, give it a name. Do this before performing the other two actions, described next.

Giving names to dangling commits prevents the garbage collector from deleting them. You can name a dangling commit by creating an annotated tag. The following example creates an annotated tag called dangle1:

Shell
$ git tag -m 'Named this dangling commit' -a dangle1 283492384928349823

Verifying Data Integrity

The following verifies the integrity of the repository's object database, and prunes dangling objects.

Shell
$ git fsck --unreachable --dangling --no-reflogs

Extra-Careful Garbage Collection

The following code runs git gc (garbage collection) with extra care and attention. It also expires the contents reflog, in other words, it empties the reflog. For more information please see Configuring Garbage Collection.

git gc removes unreachable (“dangling”) objects, which might be commits, trees (directories), and blobs (files). An object is unreachable if it is not part of the history of some branch.

git gc does not normally remove unreachable objects that are younger than two weeks, so we use --prune=now which means “remove unreachable objects that were created before now”.

Shell
$ git gc --aggressive --prune=now

Expiring the Reflog

We need to remove the reflogs to remove blobs that are not reachable from any branch. Please see Reflog Configuration for more information. We do so by expiring --all reflogs. --expire-unreachable=now.

Shell
$ git reflog expire --expire-unreachable=now --all

Contending With Inspection Problems

This section is dedicated to overcoming issues with git repositories provided by the other party, whether deliberately caused or due to honest mistakes.

The computers that are provided to software experts when visiting the opposition's clean room to inspect their client's software never have internet access. This means that commands like git fetch and git clone are non-functional. The lack of connectivity restricts options for dealing with issues.

Sanity Check

The git fsck command can be used to verify the integrity of a git repository. It can also identify dangling and unreachable objects.

git-fsck tests SHA-1 and general object sanity, and it does full tracking of the resulting reachability and everything else. It prints out any corruption it finds (missing or bad objects), and if you use the --unreachable flag it will also print out objects that exist, but aren’t reachable from any of the specified head nodes (or the default set, as mentioned above).

--lost-found
Write dangling objects into .git/lost-found/commit/ or .git/lost-found/other/, depending on type. If the object is a blob, the contents are written into the file, rather than its object name.

--root
Report root nodes.

--unreachable
Print out objects that exist but, aren’t reachable from any of the reference nodes.
 – From man git-fsck

The first commit of most git repos is the root node. It is possible for a git repo to have more than one root node; in that case you will have to examine them to determine which was 'first', according to what you might mean by 'first'.

Shell
$ git fsck --lost-found --root --unreachable
root 3fa77c58f85c591f9c6a1b0510228e4aec704697
Checking object directories: 100% (256/256), done. 

Recreate HEAD

If .git/HEAD has been deleted, then git commands give an error, like the following:

Shell
$ git log
fatal: not a git repository (or any of the parent directories): .git 

Recreate HEAD to point to the tip of the master branch like this:

Shell
$ echo "ref: refs/heads/master" > .git/HEAD

If the git project you are working with was created on GitHub recently, HEAD should probably point to the tip of the main branch instead:

Shell
$ echo "ref: refs/heads/main" > .git/HEAD

Now git commands should work, unless other problems are also present.

Rebuild Index

If the staging area in .git/index has been deleted, the git status command shows all the files and directories in the project as having been deleted, and also shows those same files as being untracked. Since a file or directory cannot both be deleted and untracked, this contradictory result indicates that .git/index was deleted or is damaged.

Shell
$ rm .git/index
$ git status On branch master Your branch is up to date with 'origin/master'.
Changes to be committed: (use "git restore --staged ..." to unstage) deleted: .gitignore deleted: .rspec deleted: .rubocop.yml
Untracked files: (use "git add ..." to include in what will be committed) .gitignore .rspec .rubocop.yml

To rebuild index, without disturbing the worktree, type:

Shell
$ git reset --mixed

$ git status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean

Commits

Obtaining the Hash of the First Commit

The git fsck --root option shown above yields the hash of the first commit, but that value is mixed with other tokens which are a pain to parse. To display the hash of the first commit such that it can be easily stored into an environment variable, use the following incantation:

Shell
$ git log --reverse --format="%h" | head -n 1

Define the environment variable COMMIT0 like this:

Shell
$ COMMIT0="$( git log --reverse --format="%h" | head -n 1 )"

Display Files in a Commit

The following incantation lists the filenames in a commit: The --root option allows this to work with the root commit.

Shell
$ git show --format="" --name-only --root $COMMIT0

The following incantation displays the files changed by a commit.

Shell
$ git diff-tree -r --name-only --root $COMMIT0

File Information

Display File From a Hash

Display the contents of a file, given its hash:

Shell
$ git cat-file -p a997766

Discovering the Commit that Added a File

The hash of the commit that added the first version of a file is easily discovered with the following incantation.

Shell
$ git log --format="%h" --diff-filter=A -- README.md
3fa77c5 

We can save the result in an environment variable called COMMIT1. This environment variable will be used in the remainder of this document.

Shell
$ COMMIT1="$( git log --format="%h" --diff-filter=A -- README.md )"

$ echo $COMMIT1
3fa77c5 

Display File Version in a Commit

If you know the commit hash, the file contents as it existed in the commit can be displayed.

Shell
$ git show $COMMIT1:README.md

Diffs of a File Against the HEAD Version

To compare the version of the file in the commit to the currently checked out version of the file, provide the hash of the commit and the name of the file to the git diff command. Recall that $COMMIT1 refers to the hash of the commit that contains the first version of README.md.

Shell
$ git diff $COMMIT1 README.md

Diffs of Any Versions of a File

To compare to another version of the same file (for example, the version that existed before the previous 2 commits to the current branch). Note that the version that existed 2 commits ago might be identical to the version pointed to by $COMMIT because there is no guarantee that those 2 commits modified this file.

Shell
$ git diff $COMMIT1 HEAD~2 -- README.md

It is often more useful to examine the changes to a file instead. To obtain the hashes of all modifications to a file, excluding the commit that added the file to the repository, use --diff-filter=M:

Shell
$ README_MODS="$( git log --format="%h" --diff-filter=M -- README.md )"

$ echo "$README_MODS" # quotes keep each value on a separate line:
841c17a
7e30894
18a09e3
d71002d 

$ echo "$README_MODS" | tac # Reverse the list
d71002d
18a09e3
7e30894
841c17a 

To obtain the hash of the 2nd change to the file, which is the 3rd version of the file:

Shell
$ echo "$README_MODS" | tac | sed '2q;d'
18a09e3 

To compare the 3rd version of the file (which has the hash immediately above) to the 4th version, first do some setup:

Shell
$ README3="$( echo "$README_MODS" | tac | sed '2q;d' )"

$ README4="$( echo "$README_MODS" | tac | sed '3q;d' )"

$ echo $README3 $README4
d71002d 18a09e3 

There are two types of incantations that can produce diffs of a file. The first type of incantation allows comparing two arbitrary versions. For this incantation, all that is required are the hashes of both file versions:

Shell
$ git diff $README3 $README4

The second type of incantation compares an arbitrary version against the version in HEAD. For this incantation, provide the hash of the commit, the hash of the desired version to use as the basis for comparison, and the name of the file.

Shell
$ git diff $COMMIT1 $README3 -- README.md

References

* indicates a required field.

Please select the following to receive Mike Slinn’s newsletter:

You can unsubscribe at any time by clicking the link in the footer of emails.

Mike Slinn uses Mailchimp as his marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices.