Git and libgit2

Introduction to libgit2

Published 2023-03-11. Last modified 2023-07-17.
Time to read: 4 minutes.

This page is part of the git collection.

When working as a software expert witness, I normally use several programs to analyse git repositories, including Git itself, and enhancement programs such as git-fame.

Sometimes I write specialized programs for analysing certain aspects of git repos. For simple tasks, bash might be a good choice. More complex tasks are better implemented in Python (using pygit2), or Ruby (using rugged).

Avoiding GPL Contamination

Git Is the reference implementation of git functionality. Libgit2 and its language wrappers are a simplified approximation of the reference implementation. The reason libgit2 exists is that Git is published under the GPL 2.0-only license.

Libgit2 is licensed under a very permissive license (GPLv2 with a special Linking Exception). This means that you can link against the library with any kind of software without making that software fall under the GPL. Changes to libgit2 would still be covered under its GPL license. Additionally, the example code has been released to the public domain (see the separate license for more information).
 – From Libgit2 README

Projects that use a dependency with the GPL 2.0-only license are ‘contaminated’ with the terms of the license, and must themselves also be published with the same license. That would be unacceptable for most commercial entities. This article discusses the topic in depth: Open Source Licenses to Avoid - Steps to Prevent the Legal Risk [2023].

Git CLI vs. Programmatic Access

One of the major differences between using the Git command line interpreter (CLI) and using a libgit2 language binding like rugged is how state is managed. When using the Git CLI, many queries are only possible after changing the public state of the repository.

For example, if you want to look at the contents of a file on a given branch, you would need to make that branch current before viewing the file in the working tree. This would affect everyone else who might be accessing that instance of Git, and the current branch would remain set after you viewed the file.

In contrast, state changes necessary, so a program can make a query, are local to the program that makes them. The program only affects publicly visible state when explicitly saving changes to that state. State changes for queries need not be published.

Libgit2 Language Bindings

Both pygit2 and rugged are built on the libgit2 API. Because libgit2 is implemented in the C language, it is compatible with C++. Other languages also have libraries that provide bindings for libgit2, for example Julia, Go, .NET, Node, and Rust.

Java has several implementations of libgit2 wrapper libraries, including jagged, Git24J, and JGit, however they all use Java Native Interface (JNI) to invoke libgit2. Java is notorious for poor performance and memory safety issues when invoking C libraries via JNI. This means that Java wrappers for libgit2 are not used much when working with libgit2. Java’s Project Panama is still evolving – perhaps one day a better Java wrapper for libgit2 will emerge.

GitHub, GitLab and Azure DevOps are all built on libgit2. You can examine the GitLab source code to see for yourself.

Low- to High-Level User Interfaces

libgit2 exposes Git’s low-level interface, which I discussed in Low Level Git Commands (‘Plumbing Internals’). If you are unfamiliar with Git’s low-level plumbing, working with libgit2 and its language bindings will probably be confusing and frustrating.

Terminology

diff
  delta
    patch
      hunk
        line

The Libgit2 API defines a hierarchy of terms. Knowing these definitions greatly helps one understand how to work with libgit2, and language bindings to that API.

A diff consists of deltas, which contain patches, which contain hunks, which contain lines.

The terms are defined below, along with the names of the Ruby classes that implement them. I have paraphrased the documentation where appropriate.

diff (Rugged::Diff)
A diff represents the cumulative list of differences between two snapshots of a repository, possibly filtered by a set of file name patterns.

A diff contains a list of deltas.
delta (Rugged::Diff::Delta)
A delta contains a description of changes to one file or rename operation. It might also contain helpful information about the entry if you request it. This optional information includes a similarity score and a binary flag.

A delta contains one or two hashes for a changed file, defining the old_file and new_file characteristics. Although the two sides of the delta are named old_file and new_file, they may actually correspond to entries that represent a file, a symbolic link, a submodule commit id, or a tree if you are tracking type changes or ignored/untracked directories.

The primary accessors are:
  • new_file (absent if the file was deleted).
  • old_file (absent if the file was just created).
  • status is a symbol, like :changed.
The return values of the following accessors are computationally expensive, so they are only computed if you first call git_diff_find_similar(). See the libgit2 documentation for more information.
  • binary indicates if this is a binary file.
  • similarity score.
patch (Rugged::Patch)
A patch contains a list of hunks.
hunk (Rugged::Diff::Hunk)
A hunk contains a list of modified lines in a diff, along with context, resulting from a single change in a file. You can configure the amount of context and other properties of how hunks are generated. Hunks include a header that described where it starts and ends in both the old and new versions in the delta.

A hunk.header is a String that summarizes other hunk properties, and might look like "@@ -1,16 +1,8 @@\n".

See the libgit2 documentation for information about hunk properties.
line (Rugged::Diff::Line)
A line is a portion of the data within a hunk. For text files, a line is simply a line of the hunk text; for binary files, a hunk is a data span.

The encoding of data in the file being diffed is not known, so line content can only be parsed after first examining the actual file.

Also, line data will not be NUL-byte terminated, because it just consists of a span of bytes inside a file.

The line_origin property has type Symbol, and can have the following values:
  • :context context lines exist in both the old and new versions. The :context? method returns true if line_origin has value :context.
  • :added lines only exist in the new version The :added? method returns true if line_origin has value :added.
  • :removed lines only exist in the old version The :deletion? method returns true if line_origin has value :removed.
* indicates a required field.

Please select the following to receive Mike Slinn’s newsletter:

You can unsubscribe at any time by clicking the link in the footer of emails.

Mike Slinn uses Mailchimp as his marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices.