Published 2023-03-11.
Last modified 2023-07-17.
Time to read: 4 minutes.
git collection.
When working as a software expert witness,
I normally use several programs to analyse git repositories,
including Git itself,
and enhancement programs such as
git-fame.
Sometimes I write specialized programs for analysing certain aspects of git repos.
For simple tasks, bash might be a good choice.
More complex tasks are better implemented in Python
(using pygit2), or
Ruby (using rugged).
Avoiding GPL Contamination
Git Is the reference implementation of git functionality.
Libgit2 and its language wrappers are a simplified approximation of the reference implementation.
The reason
libgit2 exists is that
Git is published under the GPL 2.0-only license.
Libgit2 is licensed under a very permissive license
(GPLv2 with a special Linking Exception).
This means that you can link against the library with any kind of software without making that software
fall under the GPL.
Changes to libgit2 would still be covered under its GPL license.
Additionally, the example code has been released to the public domain
(see the separate license for more information).
Projects that use a dependency with the GPL 2.0-only license are ‘contaminated’ with the terms of the license, and must themselves also be published with the same license. That would be unacceptable for most commercial entities. This article discusses the topic in depth: Open Source Licenses to Avoid - Steps to Prevent the Legal Risk [2023].
Git CLI vs. Programmatic Access
One of the major differences between using the Git command line interpreter (CLI) and using a libgit2
language binding like rugged is how state is managed.
When using the Git CLI, many queries are only possible after changing the public state of the repository.
For example, if you want to look at the contents of a file on a given branch, you would need to make that branch current before viewing the file in the working tree. This would affect everyone else who might be accessing that instance of Git, and the current branch would remain set after you viewed the file.
In contrast, state changes necessary, so a program can make a query, are local to the program that makes them. The program only affects publicly visible state when explicitly saving changes to that state. State changes for queries need not be published.
Libgit2 Language Bindings
Both pygit2 and rugged are built on the
libgit2 API.
Because libgit2 is implemented in the C language,
it is compatible with C++.
Other languages also have libraries that provide bindings for libgit2,
for example
Julia,
Go,
.NET,
Node,
and Rust.
Java has several implementations of libgit2 wrapper libraries, including
jagged,
Git24J,
and JGit,
however they all use Java Native Interface (JNI) to invoke libgit2.
Java is notorious for poor performance and memory safety issues when invoking C libraries via JNI.
This means that Java wrappers for libgit2 are not used much when working with libgit2.
Java’s Project Panama is still evolving –
perhaps one day a better Java wrapper for libgit2 will emerge.
GitHub, GitLab and Azure DevOps are all built on libgit2.
You can examine the
GitLab source code
to see for yourself.
Low- to High-Level User Interfaces
libgit2 exposes Git’s low-level interface,
which I discussed in
Low Level Git Commands (‘Plumbing Internals’).
If you are unfamiliar with Git’s low-level plumbing,
working with libgit2 and its language bindings will probably be confusing and frustrating.
Terminology
diff
delta
patch
hunk
line
The Libgit2 API defines a hierarchy of terms.
Knowing these definitions greatly helps one
understand how to work with libgit2,
and language bindings to that API.
A diff consists of deltas,
which contain patches,
which contain hunks,
which contain lines.
The terms are defined below, along with the names of the Ruby classes that implement them. I have paraphrased the documentation where appropriate.
-
diff(Rugged::Diff) -
A
diffrepresents the cumulative list of differences between two snapshots of a repository, possibly filtered by a set of file name patterns.
Adiffcontains a list ofdeltas. -
delta(Rugged::Diff::Delta) -
A delta contains a description of changes to one file or rename operation.
It might also contain helpful information about the entry if you request it.
This optional information includes a similarity score and a binary flag.
A delta contains one or two hashes for a changed file, defining theold_fileandnew_filecharacteristics. Although the two sides of the delta are namedold_fileandnew_file, they may actually correspond to entries that represent a file, a symbolic link, a submodule commit id, or a tree if you are tracking type changes or ignored/untracked directories.
The primary accessors are:new_file(absent if the file was deleted).old_file(absent if the file was just created).statusis a symbol, like:changed.
git_diff_find_similar(). See thelibgit2documentation for more information.binaryindicates if this is a binary file.similarityscore.
-
patch(Rugged::Patch) -
A
patchcontains a list ofhunks. -
hunk(Rugged::Diff::Hunk) -
A hunk contains a list of modified
lines in adiff, along with context, resulting from a single change in a file. You can configure the amount of context and other properties of how hunks are generated. Hunks include a header that described where it starts and ends in both the old and new versions in the delta.
Ahunk.headeris aStringthat summarizes otherhunkproperties, and might look like"@@ -1,16 +1,8 @@\n".
See thelibgit2documentation for information abouthunkproperties. -
line(Rugged::Diff::Line) -
A
lineis a portion of the data within a hunk. For text files, alineis simply a line of the hunk text; for binary files, a hunk is a data span.
The encoding of data in the file being diffed is not known, solinecontent can only be parsed after first examining the actual file.
Also,linedata will not be NUL-byte terminated, because it just consists of a span of bytes inside a file.
Theline_originproperty has typeSymbol, and can have the following values:-
:contextcontext lines exist in both the old and new versions. The:context?method returns true ifline_originhas value:context. -
:addedlines only exist in the new version The:added?method returns true ifline_originhas value:added. -
:removedlines only exist in the old version The:deletion?method returns true ifline_originhas value:removed.
-