Published 2023-03-11.
Last modified 2023-07-17.
Time to read: 4 minutes.
git
collection.
When working as a software expert witness,
I normally use several programs to analyse git repositories,
including git
itself,
and enhancement programs such as
git-fame
.
Sometimes I write specialized programs for analysing certain aspects of git repos.
For simple tasks, bash
might be a good choice.
More complex tasks are better implemented in Python
(using pygit2
), or
Ruby (using rugged
).
Avoiding GPL Contamination
Git
Is the reference implementation of git functionality.
Libgit2
and its language wrappers are a simplified approximation of the reference implementation.
The reason
libgit2
exists is that
git
is published under the GPL 2.0-only license.
Libgit2
is licensed under a very permissive license
(GPLv2 with a special Linking Exception).
This means that you can link against the library with any kind of software without making that software
fall under the GPL.
Changes to libgit2 would still be covered under its GPL license.
Additionally, the example code has been released to the public domain
(see the separate license for more information).
Projects that use a dependency with the GPL 2.0-only license are ‘contaminated’ with the terms of the license, and must themselves also be published with the same license. That would be unacceptable for most commercial entities. This article discusses the topic in depth: Open Source Licenses to Avoid - Steps to Prevent the Legal Risk [2023].
Git CLI vs. Programmatic Access
One of the major differences between using the git
command line interpreter (CLI) and using a libgit2
language binding like rugged
is how state is managed.
When using the git
CLI, many queries are only possible after changing the public state of the repo.
For example, if you want to look at the contents of a file on a given branch, you would need to make that branch current
before viewing the file in the working tree.
This would affect everyone else who might be accessing that instance of git
,
and the current branch would remain set after you viewed the file.
In contrast, state changes necessary, so a program can make a query, are local to the program that makes them. The program only affects publicly visible state when explicitly saving changes to that state. State changes for queries need not be published.
Libgit2 Language Bindings
Both pygit2
and rugged
are built on the
libgit2
API.
Because libgit2
is implemented in the C language,
it is compatible with C++.
Other languages also have libraries that provide bindings for libgit2
,
for example
Julia,
Go,
.NET,
Node,
and Rust.
Java has several implementations of libgit2
wrapper libraries, including
jagged
,
Git24J
,
and JGit,
however they all use Java Native Interface (JNI) to invoke libgit2
.
Java is notorious for poor performance and memory safety issues when invoking C libraries via JNI.
This means that Java wrappers for libgit2
are not used much when working with libgit2
.
Java’s Project Panama is still evolving –
perhaps one day a better Java wrapper for libgit2
will emerge.
GitHub, GitLab and Azure DevOps are all built on libgit2
.
You can examine the
GitLab source code
to see for yourself.
Low- to High-Level User Interfaces
libgit2
exposes git
’s low-level interface,
which I discussed in
Low Level Git Commands (‘Plumbing Internals’).
If you are unfamiliar with git
’s low-level plumbing,
working with libgit2
and its language bindings will probably be confusing and frustrating.
Terminology
diff delta patch hunk line
The Libgit2
API defines a hierarchy of terms.
Knowing these definitions greatly helps one
understand how to work with libgit2
,
and language bindings to that API.
A diff
consists of delta
s,
which contain patch
es,
which contain hunk
s,
which contain line
s.
The terms are defined below, along with the names of the Ruby classes that implement them. I have paraphrased the documentation where appropriate.
-
diff
(Rugged::Diff
) -
A
diff
represents the cumulative list of differences between two snapshots of a repository, possibly filtered by a set of file name patterns.
Adiff
contains a list ofdelta
s. -
delta
(Rugged::Diff::Delta
) -
A delta contains a description of changes to one file or rename operation.
It might also contain helpful information about the entry if you request it.
This optional information includes a similarity score and a binary flag.
A delta contains one or two hashes for a changed file, defining theold_file
andnew_file
characteristics. Although the two sides of the delta are namedold_file
andnew_file
, they may actually correspond to entries that represent a file, a symbolic link, a submodule commit id, or a tree if you are tracking type changes or ignored/untracked directories.
The primary accessors are:new_file
(absent if the file was deleted).old_file
(absent if the file was just created).status
is a symbol, like:changed
.
git_diff_find_similar()
. See thelibgit2
documentation for more information.binary
indicates if this is a binary file.similarity
score.
-
patch
(Rugged::Patch
) -
A
patch
contains a list ofhunk
s. -
hunk
(Rugged::Diff::Hunk
) -
A hunk contains a list of modified
line
s in adiff
, along with context, resulting from a single change in a file. You can configure the amount of context and other properties of how hunks are generated. Hunks include a header that described where it starts and ends in both the old and new versions in the delta.
Ahunk.header
is aString
that summarizes otherhunk
properties, and might look like"@@ -1,16 +1,8 @@\n"
.
See thelibgit2
documentation for information abouthunk
properties. -
line
(Rugged::Diff::Line
) -
A
line
is a portion of the data within a hunk. For text files, aline
is simply a line of the hunk text; for binary files, a hunk is a data span.
The encoding of data in the file being diffed is not known, soline
content can only be parsed after first examining the actual file.
Also,line
data will not be NUL-byte terminated, because it just consists of a span of bytes inside a file.
Theline_origin
property has typeSymbol
, and can have the following values:-
:context
context lines exist in both the old and new versions. The:context?
method returns true ifline_origin
has value:context
. -
:added
lines only exist in the new version The:added?
method returns true ifline_origin
has value:added
. -
:removed
lines only exist in the old version The:deletion?
method returns true ifline_origin
has value:removed
.
-