Published 2025-01-06.
Last modified 2026-02-01.
Time to read: 9 minutes.
git collection.
- Git Large File System Overview
- How Many Copies Of Large Files on Git LFS Clients?
- Git LFS Client Installation
- Git LFS Server URLs
- Git-ls-files, Wildmatch Patterns and Permutation Scripts
- Git LFS Tracking, Migration and Un-Migration
- Git LFS Client Configuration & Commands
- Working With Git LFS
Instructions for typing along are given for Ubuntu and WSL/Ubuntu. If you have a Mac, the compiled Go programs provided on GitHub should install easily, and most of the textual information should be helpful.
Gemini and I had the following discussion.
When a user checks out a large file using Git LFS, how many copies of the large file exist on their computer? Under which circumstances might they have more than one copy?
In a standard Git LFS setup on a workstation, every tracked large file exists at least twice on your local disk.
This is a fundamental part of its design, and it is the primary reason why it is often unsuitable for real-time media workstations with limited SSD space.
1. The Two-Copy Reality
When you perform a git checkout, Git LFS populates your folder in two distinct
locations:
- The LFS Cache (The “Secret” Copy): Stored inside
.git/lfs/objects/. This is the “source of truth” for the specific version you downloaded. It is named by its SHA-256 hash (e.g.,ad1b8...). - The Working Directory (The “Real” Copy): This is the file you actually
see and open in your DAW or video editor (e.g.,
shot_01.mov).
Why does it do this? Git LFS follows Git’s internal philosophy: the
repository’s database (the cache) must remain separate from the user’s
workspace. When you switch branches, Git LFS deletes the file in your working
directory and copies a different version from the .git/lfs cache.
2. Circumstances for More (or Fewer) Copies
While two copies is the default, your workstation might end up with more—or slightly fewer—depending on these specific technical scenarios:
A. More than two: Resolving Merge Conflicts
If you have a merge conflict on a 250MB video file, you may temporarily have three or more copies.
- The version from the
mainbranch. - The version from your
featurebranch. - The common ancestor (if you are running a 3-way diff).
- The “resolved” version you are currently writing.
B. More than two: Multiple Branches in Cache
By default, Git LFS does not delete old files from the .git/lfs/objects folder
when you switch branches.
- If you have 10 versions of a 250MB file across your history and you have checked them all out at some point, you will have 10 copies in the cache plus 1 copy in your working directory.
- Fix: You must manually run
git lfs pruneto delete the versions that aren’t currently checked out.
C. Only one copy: Using git lfs dedup (The Exception)
On specific filesystems that support Reflinks (APFS on Mac, ReFS on Windows, or Btrfs/XFS on Linux), you can run a manual command:
git lfs dedup
This command attempts to “de-duplicate” the two copies by telling the OS to point both the cache and the working directory to the same physical blocks on the SSD.
- The Catch: This is not automatic during a standard
git checkout. Every time you switch branches, Git LFS will revert to making a full copy, and you would have to rundedupagain to reclaim the space.
Summary of the Space Penalty
For your 250MB video files, a standard Git LFS workflow effectively treats them as 500MB of consumed space on your drive.
| Tool | Copies on Disk | Workflow Impact |
|---|---|---|
| Git LFS | 2+ | Significant SSD bloat; slow checkouts due to “smudge” |
| copy time. | ||
| XVC / DVC | 1 | Uses hardlinks/reflinks automatically; checkout is |
| nearly instant. |
but the copy in .git/objects/ is just a small reference (smudge) file
To be precise about the “two copies” mechanic:
1. The Pointer (Smudge) File
There is a small text-based pointer file (the “smudge” target),
that it lives in the Git Object Database
(.git/objects/).
- Size: ~100 bytes.
- Content: The SHA-256 hash and the file size.
- Role: This is what Git commits, pushes, and pulls.
2. The Local LFS Cache (The First Full Copy)
When you git pull, Git LFS downloads the actual 250MB binary from the server.
It stores this full file in:
.git/lfs/objects/[hash-prefix]/[full-hash]
- Size: 250MB (The actual binary).
- Role: This is your local “Source of Truth.” Git LFS keeps this here so that if you switch branches and then come back, it doesn’t have to download the 250MB from the internet again.
3. The Working Directory (The Second Full Copy)
When you run git checkout, the Git LFS smudge filter is triggered. It
reads the 100-byte pointer from Git, looks up the 250MB binary in your
.git/lfs/objects/ cache, and copies it into your project folder.
- Size: 250MB.
- Role: This is the file your DAW or video editor actually opens.
When do you have more than 2 copies?
These scenarios increase the duplication:
-
Multiple Branches: If you have 5 different edits of the same 250MB video file, and you have checked them out at different times, you will have 5 full copies in
.git/lfs/objects/and 1 copy in your working directory. Total: 1.5 GB for one active file. -
Staging Area (The “Index”): During a
git add, Git LFS runs the clean filter. It creates a temporary copy while hashing the file to move it into the LFS storage. -
Cross-Drive Projects: If your
.gitfolder is on a different drive (or partition) than your working files, the OS is forced to perform a full byte-for-byte copy during checkout.
The “Immutable” Optimization You’re Looking For
Because your large files are immutable, you correctly identify that we should only need one copy.
- Git LFS fails here because its default behavior is
cp(copy). * DVC/XVC/Dud succeed here because they useln(hard link) orreflink.
If you were to stay with Git LFS, you would have to manually run git lfs dedup
after every checkout to turn those two 250MB copies into a single 250MB physical
allocation via reflinks. This is why a tool like XVC is better for you; it
makes that “one copy” behavior the automated default.
Gemini’s suggestion of running dedup after every checkout
was not practical, and the expanation was so over-simplified as to be terrible
advice. Yes, hardlinks are possible, but the machinery coordinated by the Git
client and the Git LFS client extension would need to be replaced. I
eventually ended up doing that for a project, it is was a large and complex
setup with many moving parts.
is it true that all git lfs client-side traffic must flow through the git client? how might that be a bottleneck for git lfs signed urls?
Actually, it is not true that all Git LFS traffic flows through the Git client. Understanding this distinction is key to diagnosing the performance issues you’re seeing on your workstation.
1. The “Control Plane” vs. “Data Plane”
Git LFS separates the negotiation from the transfer.
- Control Plane (Git Client): When you run
git pull, the Git client talks to the Git server (like GitHub) to exchange those small 100-byte pointer files. Git itself handles the versioning logic. - Data Plane (LFS Client): Once Git encounters a pointer, it triggers the Git LFS agent (a separate binary). This agent talks directly to the LFS server (often an S3 bucket or Artifactory) using the Batch API.
The data transfer happens directly between the git-lfs binary and the
storage backend. It does not go through the main git process.
2. How Signed URLs Bypass the Server
Modern LFS implementations (like GitHub or GitLab) use Signed URLs specifically to avoid bottlenecks.
- Request: Your LFS client asks the LFS server: “I need the bytes for
SHA-256:
ad1b8....” - Redirect: The server doesn’t send the data. It sends back a Temporary Signed URL (e.g., an AWS S3 URL with an authentication token).
- Direct Download: Your LFS client then makes a standard
HTTP GETrequest directly to S3.
The data never touches the Git server. It flows from S3 directly to your
git-lfs process and then to your SSD.
3. The Real Bottleneck: The “Smudge” Pipeline
The bottleneck you’re likely experiencing isn’t the network traffic—it’s the OS-level IPC (Inter-Process Communication).
Even though the data doesn’t flow through the Git server, it must flow through a Pipe if you use the default “Smudge” filter:
- The Pipe Bottleneck: Git expects to receive the file contents on
STDOUTfrom the smudge filter to write it to your working directory. For a 250MB video,git-lfsis effectively “piping” those 250MB togit. - Sequential Processing: By default, Git runs this process one file at a time. If you have 100 video files, Git starts the LFS agent, feeds it a pointer, waits for the 250MB download/pipe, closes it, and repeats. This is a massive serial bottleneck, especially on Windows where process creation is expensive.
4. How to Bypass the Bottleneck
Because you are on a high-performance workstation, you should stop relying on the “automatic” smudge during clone/pull.
The “Systems Programmer” approach to LFS speed:
- Skip Smudging:
git lfs install --skip-smudge. This tells Git to only download the 100-byte pointers. Yourgit pullwill be near-instant. - Bulk Pull: Run
git lfs pullmanually afterward.- Why? Unlike the smudge filter,
git lfs pulluses the Batch API to download multiple files in parallel (usually 3–10 at a time). It writes directly to the.git/lfs/objectsfolder and then to your workspace, bypassing thegitpipe entirely.
- Why? Unlike the smudge filter,
Summary for your 250MB files:
The sluggishness you feel is the Git LFS agent being forced to talk through a
single-threaded pipe to the Git process. By skipping the smudge and using xvc
or git lfs pull, you move to a parallel, direct-to-disk architecture.
Would you like me to show you the Git config to increase the
lfs.concurrenttransfers limit so your 250MB downloads actually saturate your
network link?
yes
To increase the speed of your 250MB video downloads and bypass the sequential “smudge” bottleneck, you can tune your Git LFS configuration for high-concurrency parallel transfers.
1. Increase Concurrent Transfers
By default, Git LFS downloads only 3 to 8 files at once (depending on your version). On a high-bandwidth workstation, you can safely increase this to saturate your link.
Run this command to set the limit globally:
git config --global lfs.concurrenttransfers 64
- The Logic: This tells the LFS agent to initiate 64 simultaneous HTTP requests to your storage backend (e.g., S3 via Signed URLs). For 250MB files, this ensures that while one file is finishing, dozens of others are already mid-stream.
- Note: Some servers may rate-limit you if this is set too high (e.g., above 100), but 64 is generally a “sweet spot” for modern systems.
2. The “Fast Path” Workflow
To actually see the benefit of this concurrency, you must avoid the git
checkout smudge filter (which is serial) and use the Batch API via lfs
pull.
The High-Performance Routine:
- Configure LFS to skip the “oink” (Smudge):
git config --global filter.lfs.smudge "git-lfs smudge --skip -- %f"
This ensures git pull only downloads the tiny pointers, making your Git
operations instant.
- Pull Data in Parallel:
When you actually need the video files for your DAW:
git lfs pull
Because lfs pull is a standalone command, it uses the concurrenttransfers
setting to download your video assets in parallel directly to your disk.
3. Verification
You can verify your current active settings by running:
git lfs env | grep ConcurrentTransfers
Why this fixes the Signed URL bottleneck:
When using Signed URLs, the LFS server hands you a list of 100+ URLs at once.
- The Default Way: Git LFS downloads URL #1, finishes it, then starts URL #2. Your 10Gbps link sits at 5% utilization.
- The Concurrent Way: Git LFS fires off requests for URLs #1 through #64 simultaneously. Your network card finally has enough work to do, and your 250MB files arrive in “waves” rather than a slow trickle.
Would you like me to show you how to set up a “transfer agent” in Go if you find that the standard Git LFS client is still too slow for your specific workstation needs?
yes
To gain ultimate control over your 250MB transfers and bypass the standard “smudge” bottleneck, you can write a Custom Transfer Agent in Go.
The Git LFS client supports a JSON-based protocol over stdin/stdout.
When Git LFS needs to move data, it spawns your Go binary and sends it specific
commands. This allows you to implement your own logic (e.g., direct S3 multipart
uploads, peer-to-peer syncing, or custom encryption) while Git LFS handles the
metadata.
1. The Protocol Flow
Git LFS communicates with your agent in three stages:
- Initiation: Exchange capabilities (version check).
- Commands:
download,upload, orterminate. - Progress/Results: Your agent reports back as bytes move.
2. Go Implementation Skeleton
A basic Go agent needs to handle the JSON stream. Here is the architectural skeleton:
package main
import (
"bufio"
"encoding/json"
"os"
)
type Message struct {
Event string `json:"event"`
Oid string `json:"oid,omitempty"`
Size int64 `json:"size,omitempty"`
Path string `json:"path,omitempty"`
Action string `json:"action,omitempty"`
}
func main() {
reader := bufio.NewReader(os.Stdin)
// 1. Initiation: Read the init message and respond with {}
_, _ = reader.ReadString('\n')
os.Stdout.WriteString("{}\n")
// 2. Main Loop: Handle transfer requests
for {
line, _ := reader.ReadString('\n')
var msg Message
json.Unmarshal([]byte(line), &msg)
if msg.Event == "terminate" {
break
}
if msg.Event == "download" {
// Perform your high-speed Go download logic here
// reportProgress(msg.Oid, bytesSoFar)
// Complete the transfer
resp, _ := json.Marshal(Message{Event: "complete", Oid: msg.Oid, Path:
"/local/path/to/file"})
os.Stdout.WriteString(string(resp) + "\n")
}
}
3. Configuration
Once your Go binary (e.g., my-lfs-agent) is compiled and in your path, you
tell Git LFS to use it:
# Define the agent path
git config lfs.customtransfer.myagent.path "my-lfs-agent"
# Set it as a standalone agent (bypasses the standard LFS API server)
git config lfs.standalonetransferagent myagent
# Optional: Disable concurrency if you want to handle it inside your Go code
git config lfs.customtransfer.myagent.concurrent false
Why this is the “Pro” Move for your Workstation
- No Smudge Pipe: By returning the
Pathin thecompletemessage, Git LFS moves your file directly into the cache. You are no longer streaming 250MB through a standard pipe. - Internal Concurrency: You can use Go routines and
contextto manage your own worker pools, ensuring your 10Gbps link is fully utilized without Git managing the threads. - Signed URL Optimization: If your backend uses Signed URLs, your Go agent can pre-fetch or refresh them dynamically, which the standard LFS client occasionally struggles with during long-running transfers.
Existing implementations to study:
-
charmbracelet/git-lfs-transfer: A Go implementation of the SSH-based transfer protocol.
-
nicolas-graves/lfs-s3: A specialized Go agent for S3 backends.
- Git Large File System Overview
- How Many Copies Of Large Files on Git LFS Clients?
- Git LFS Client Installation
- Git LFS Server URLs
- Git-ls-files, Wildmatch Patterns and Permutation Scripts
- Git LFS Tracking, Migration and Un-Migration
- Git LFS Client Configuration & Commands
- Working With Git LFS
Instructions for typing along are given for Ubuntu and WSL/Ubuntu. If you have a Mac, the compiled Go programs provided on GitHub should install easily, and most of the textual information should be helpful.