Last modified 2023-03-23.
Time to read: 4 minutes.
gitcollection, categorized under Azure, Git, GitHub, Java, Python, Ruby, Software-Expert.
When working as a software expert witness,
I normally use several programs to analyse git repositories,
and enhancement programs such as
Sometimes I write specialized programs for analysing certain aspects of git repos.
For simple tasks,
bash might be a good choice.
More complex tasks are better implemented in Python
Git CLI vs. Programmatic Access
One of the major differences between using the
git command line interpreter (CLI) and using a
language binding like
rugged is how state is managed.
When using the
git CLI, many queries are only possible after changing the public state of the repo.
For example, if you want to look at the contents of a file on a given branch, you would need to make that branch current
before viewing the file in the working tree.
This would affect everyone else who might be accessing that instance of
and the current branch would remain set after you viewed the file.
In contrast, state changes necessary, so a program can make a query, are local to the program that makes them. The program only affects publicly visible state when explicitly saving changes to that state. State changes for queries need not be published.
rugged are built on the
libgit2 is implemented in the C language,
it is compatible with C++.
Other languages also have libraries that provide bindings for
Java has several implementations of
libgit2 wrapper libraries, including
however they all use Java Native Interface (JNI) to invoke
Java is notorious for poor performance and memory safety issues when invoking C libraries via JNI.
This means that Java wrappers for
libgit2 are not used much when working with
Java’s Project Panama is still evolving –
perhaps one day a better Java wrapper for
libgit2 will emerge.
GitHub, GitLab and Azure DevOps are all built on
You can examine the
GitLab source code
to see for yourself.
Low- to High-Level User Interfaces
git’s low-level interface,
which I discussed in
Low Level Git Commands (‘Plumbing Internals’).
If you are unfamiliar with
git’s low-level plumbing,
libgit2 and its language bindings will probably be confusing and frustrating.
diff delta patch hunk line
Libgit2 API defines a hierarchy of terms.
Knowing these definitions greatly helps one
understand how to work with
and language bindings to that API.
diff consists of
The terms are defined below, along with the names of the Ruby classes that implement them. I have paraphrased the documentation where appropriate.
diffrepresents the cumulative list of differences between two snapshots of a repository, possibly filtered by a set of file name patterns.
diffcontains a list of
A delta contains a description of changes to one file or rename operation.
It might also contain helpful information about the entry if you request it.
This optional information includes a similarity score and a binary flag.
A delta contains one or two hashes for a changed file, defining the
new_filecharacteristics. Although the two sides of the delta are named
new_file, they may actually correspond to entries that represent a file, a symbolic link, a submodule commit id, or a tree if you are tracking type changes or ignored/untracked directories.
The primary accessors are:
new_file(absent if the file was deleted).
old_file(absent if the file was just created).
statusis a symbol, like
git_diff_find_similar(). See the
libgit2documentation for more information.
binaryindicates if this is a binary file.
patchcontains a list of
A hunk contains a list of modified
lines in a
diff, along with context, resulting from a single change in a file. You can configure the amount of context and other properties of how hunks are generated. Hunks include a header that described where it starts and ends in both the old and new versions in the delta.
Stringthat summarizes other
hunkproperties, and might look like
"@@ -1,16 +1,8 @@\n".
libgit2documentation for information about
lineis a portion of the data within a hunk. For text files, a
lineis simply a line of the hunk text; for binary files, a hunk is a data span.
The encoding of data in the file being diffed is not known, so
linecontent can only be parsed after first examining the actual file.
linedata will not be NUL-byte terminated, because it just consists of a span of bytes inside a file.
line_originproperty has type
Symbol, and can have the following values:
:contextcontext lines exist in both the old and new versions. The
:context?method returns true if
:addedlines only exist in the new version The
:added?method returns true if
:removedlines only exist in the old version The
:deletion?method returns true if