Published 2023-07-13.
Last modified 2023-10-06.
Time to read: 6 minutes.
git
collection.
When on an expert witness assignment, I often inspect Git repositories provided by the opposing party's lawyers. Quite often, those repositories have issues that seriously impede Git operation. Although the motivation for a legal team to hamper the work of an opposing expert is readily apparent, often it is technical ignorance and not malice that causes problems.
This page contains my notes on:
- Preventing problems for parties that need to share a Git repository
- Overcoming issues with Git repositories provided by the other party, whether deliberately caused or due to honest mistakes
Typical Git Files and Directories
Following are the files and directories provided after a typical git init
,
and checking in a few files.
The 256 subdirectories of the objects/
directory are not shown for clarity.
The hooks/
and logs/refs/remotes
directories are also not shown,
since they are unnecessary when there is no internet access.
$ find .git -type d \ -not -path ".git/objects/*" \ -not -path ".git/hooks*" \ -not -path ".git/logs/refs/remotes*" | \ sed -E 's^.git/?^^' | \ column -c 80 HEAD info/exclude COMMIT_EDITMSG objects ORIG_HEAD index config refs logs refs/remotes logs/HEAD refs/remotes/origin logs/refs logs/refs/heads refs/remotes/origin/master refs/tags logs/refs/heads/master refs/heads description branches refs/heads/master info FETCH_HEAD
Maximizing Good Will
This section is dedicated to preventing problems for parties that need to share a Git repository.
Litigation is by nature an adversarial activity. When computer software changes evidence without notice or warning, people tend blame each other.
A dangling commit is a commit that is unreachable from any other commit. One way to make a dangling commit is to make a commit on a detached head.
Git runs garbage collection periodically. This happens without warning. One of the functions that Git garbage collection performs is to delete (prune) all dangling commits. There is no message to alert the user that dangling commits were found or that they were pruned.
This is can lead to investigators experiencing files disappearing from a Git repo after a period of time, as if they were written in the digital equivalent of disappearing ink. The party that obtained the Git repository might accusing the other party of destroying evidence.
To avoid this potentially very damaging accusation, 3 actions should be performed before giving a Git repository to another party:
- Name any dangling commits that you want to preserve
- Verify the integrity of the Git object database
- Run the garbage collection with extra care
Naming Dangling Commits
If you want to preserve a dangling commit, give it a name. Do this before performing the other two actions, described next.
Giving names to dangling commits prevents the garbage collector from deleting them.
You can name a dangling commit by creating an annotated tag.
The following example creates an annotated tag called dangle1
:
$ git tag -m 'Named this dangling commit' -a dangle1 283492384928349823
Verifying Data Integrity
The following verifies the integrity of the repository's object database, and prunes dangling objects.
$ git fsck --unreachable --dangling --no-reflogs
Extra-Careful Garbage Collection
The following code runs git gc
(garbage collection) with extra care and attention.
It also expires the contents reflog
, in other words, it empties the reflog
.
For more information please see Configuring Garbage Collection.
git gc
removes unreachable (“dangling”) objects, which might be commits, trees (directories), and blobs (files).
An object is unreachable if it is not part of the history of some branch.
git gc
does not normally remove unreachable objects that are younger than two weeks,
so we use --prune=now
which means
“remove unreachable objects that were created before now”.
$ git gc --aggressive --prune=now
Expiring the Reflog
We need to remove the reflogs to remove blobs that are not reachable from any branch.
Please see Reflog Configuration for more information.
We do so by expiring --all
reflogs.
--expire-unreachable=now
.
$ git reflog expire --expire-unreachable=now --all
Contending With Inspection Problems
This section is dedicated to overcoming issues with Git repositories provided by the other party, whether deliberately caused or due to honest mistakes.
The computers that are provided to software experts when visiting the opposition's clean room to inspect their client's software
never have internet access.
This means that commands like git fetch
and git clone
are non-functional.
The lack of connectivity restricts options for dealing with issues.
Sanity Check
The git fsck
command can be used to verify the integrity of a Git repository.
It can also identify dangling and unreachable objects.
git-fsck
tests SHA-1
and general object sanity, and it does
full tracking of the resulting reachability and everything
else. It prints out any corruption it finds (missing or bad
objects), and if you use the --unreachable flag it will also
print out objects that exist, but aren’t reachable from
any of the specified head nodes (or the default set, as
mentioned above).--lost-found
Write dangling objects into
.git/lost-found/commit/
or
.git/lost-found/other/
, depending on type. If the object
is a blob, the contents are written into the file, rather
than its object name.--root
Report root nodes.
--unreachable
Print out objects that exist but, aren’t reachable from any of the reference nodes.
The first commit of most Git repos is the root
node.
It is possible for a Git repository to have more than one root
node;
in that case you will have to examine them to determine which was 'first',
according to what you might mean by 'first'.
$ git fsck --lost-found --root --unreachable root 3fa77c58f85c591f9c6a1b0510228e4aec704697 Checking object directories: 100% (256/256), done.
Recreate HEAD
If .git/HEAD
has been deleted, then Git commands give an error,
like the following:
$ git log fatal: not a git repository (or any of the parent directories): .git
Recreate HEAD
to point to the tip of the master
branch like this:
$ echo "ref: refs/heads/master" > .git/HEAD
If the Git project you are working with was created on GitHub recently,
HEAD
should probably point to the tip of the main
branch instead:
$ echo "ref: refs/heads/main" > .git/HEAD
Now Git commands should work, unless other problems are also present.
Rebuild Index
If the staging area in .git/index
has been deleted, the git status
command shows all the files and directories in the project as having been deleted,
and also shows those same files as being untracked.
Since a file or directory cannot both be deleted and untracked,
this contradictory result indicates that .git/index
was deleted or is damaged.
$ rm .git/index
$ git status On branch master Your branch is up to date with 'origin/master'.
Changes to be committed: (use "git restore --staged..." to unstage) deleted: .gitignore deleted: .rspec deleted: .rubocop.yml
Untracked files: (use "git add..." to include in what will be committed) .gitignore .rspec .rubocop.yml
To rebuild index
, without disturbing the worktree, type:
$ git reset --mixed $ git status On branch master Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
Commits
Obtaining the Hash of the First Commit
The git fsck --root
option shown above
yields the hash of the first commit, but
that value is mixed with other tokens which are a pain to parse.
To display the hash of the first commit such that it can be easily stored into an environment variable,
use the following incantation:
$ git log --reverse --format="%h" | head -n 1
Define the environment variable COMMIT0
like this:
$ COMMIT0="$( git log --reverse --format="%h" | head -n 1 )"
Display Files in a Commit
The following incantation lists the filenames in a commit:
The --root
option allows this to work with the root commit.
$ git show --format="" --name-only --root $COMMIT0
The following incantation displays the files changed by a commit.
$ git diff-tree -r --name-only --root $COMMIT0
File Information
Display File From a Hash
Display the contents of a file, given its hash:
$ git cat-file -p a997766
Discovering the Commit that Added a File
The hash of the commit that added the first version of a file is easily discovered with the following incantation.
$ git log --format="%h" --diff-filter=A -- README.md 3fa77c5
We can save the result in an environment variable called COMMIT1
.
This environment variable will be used in the remainder of this document.
$ COMMIT1="$( git log --format="%h" --diff-filter=A -- README.md )" $ echo $COMMIT1 3fa77c5
Display File Version in a Commit
If you know the commit hash, the file contents as it existed in the commit can be displayed.
$ git show $COMMIT1:README.md
Diffs of a File Against the HEAD Version
To compare the version of the file in the commit to the currently checked out version of the file,
provide the hash of the commit and the name of the file to the git diff
command.
Recall that $COMMIT1
refers to the hash of the commit that contains the
first version of README.md
.
$ git diff $COMMIT1 README.md
Diffs of Any Versions of a File
To compare to another version of the same file
(for example, the version that existed before the previous 2 commits to the current branch).
Note that the version that existed 2 commits ago might be identical to the version pointed
to by $COMMIT
because there is no guarantee that those 2 commits modified this file.
$ git diff $COMMIT1 HEAD~2 -- README.md
It is often more useful to examine the changes to a file instead.
To obtain the hashes of all modifications to a file,
excluding the commit that added the file to the repository,
use --diff-filter=M
:
$ README_MODS="$( git log --format="%h" --diff-filter=M -- README.md )" $ echo "$README_MODS" # quotes keep each value on a separate line: 841c17a 7e30894 18a09e3 d71002d $ echo "$README_MODS" | tac # Reverse the list d71002d 18a09e3 7e30894 841c17a
To obtain the hash of the 2nd change to the file, which is the 3rd version of the file:
$ echo "$README_MODS" | tac | sed '2q;d' 18a09e3
To compare the 3rd version of the file (which has the hash immediately above) to the 4th version, first do some setup:
$ README3="$( echo "$README_MODS" | tac | sed '2q;d' )" $ README4="$( echo "$README_MODS" | tac | sed '3q;d' )" $ echo $README3 $README4 d71002d 18a09e3
There are two types of incantations that can produce diffs of a file. The first type of incantation allows comparing two arbitrary versions. For this incantation, all that is required are the hashes of both file versions:
$ git diff $README3 $README4
The second type of incantation compares an arbitrary version against the version in HEAD
.
For this incantation, provide the hash
of the commit,
the hash of the desired version to use as the basis for comparison,
and the name of the file.
$ git diff $COMMIT1 $README3 -- README.md
References
- Pro Git book by Scott Chacon and Ben Straub.
- Git Cookbook by Dennis Kaarsemaker.