How Many Copies Of Large Files on Git LFS Clients?

Published 2025-01-06. Last modified 2026-02-01.
Time to read: 9 minutes.

This page is part of the git collection.

Instructions for typing along are given for Ubuntu and WSL/Ubuntu. If you have a Mac, the compiled Go programs provided on GitHub should install easily, and most of the textual information should be helpful.

Gemini and I had the following discussion.

Mike Slinn When a user checks out a large file using Git LFS, how many copies of the large file exist on their computer? Under which circumstances might they have more than one copy?

Gemini In a standard Git LFS setup on a workstation, every tracked large file exists at least twice on your local disk.

This is a fundamental part of its design, and it is the primary reason why it is often unsuitable for real-time media workstations with limited SSD space.

1. The Two-Copy Reality

When you perform a git checkout, Git LFS populates your folder in two distinct locations:

The LFS Cache (The “Secret” Copy): Stored inside .git/lfs/objects/. This is the “source of truth” for the specific version you downloaded. It is named by its SHA-256 hash (e.g., ad1b8...).
The Working Directory (The “Real” Copy): This is the file you actually see and open in your DAW or video editor (e.g., shot_01.mov).

Why does it do this? Git LFS follows Git’s internal philosophy: the repository’s database (the cache) must remain separate from the user’s workspace. When you switch branches, Git LFS deletes the file in your working directory and copies a different version from the .git/lfs cache.

2. Circumstances for More (or Fewer) Copies

While two copies is the default, your workstation might end up with more—or slightly fewer—depending on these specific technical scenarios:

A. More than two: Resolving Merge Conflicts

If you have a merge conflict on a 250MB video file, you may temporarily have three or more copies.

The version from the main branch.
The version from your feature branch.
The common ancestor (if you are running a 3-way diff).
The “resolved” version you are currently writing.

B. More than two: Multiple Branches in Cache

By default, Git LFS does not delete old files from the .git/lfs/objects folder when you switch branches.

If you have 10 versions of a 250MB file across your history and you have checked them all out at some point, you will have 10 copies in the cache plus 1 copy in your working directory.
Fix: You must manually run git lfs prune to delete the versions that aren’t currently checked out.

C. Only one copy: Using `git lfs dedup` (The Exception)

On specific filesystems that support Reflinks (APFS on Mac, ReFS on Windows, or Btrfs/XFS on Linux), you can run a manual command:

git lfs dedup

This command attempts to “de-duplicate” the two copies by telling the OS to point both the cache and the working directory to the same physical blocks on the SSD.

The Catch: This is not automatic during a standard git checkout. Every time you switch branches, Git LFS will revert to making a full copy, and you would have to run dedup again to reclaim the space.

Summary of the Space Penalty

For your 250MB video files, a standard Git LFS workflow effectively treats them as 500MB of consumed space on your drive.

Tool	Copies on Disk	Workflow Impact
Git LFS	2+	Significant SSD bloat; slow checkouts due to “smudge”
copy time.
XVC / DVC	1	Uses hardlinks/reflinks automatically; checkout is
nearly instant.

Mike Slinn but the copy in .git/objects/ is just a small reference (smudge) file

Gemini To be precise about the “two copies” mechanic:

1. The Pointer (Smudge) File

There is a small text-based pointer file (the “smudge” target), that it lives in the Git Object Database (.git/objects/).

Size: ~100 bytes.
Content: The SHA-256 hash and the file size.
Role: This is what Git commits, pushes, and pulls.

2. The Local LFS Cache (The First Full Copy)

When you git pull, Git LFS downloads the actual 250MB binary from the server. It stores this full file in: .git/lfs/objects/[hash-prefix]/[full-hash]

Size: 250MB (The actual binary).
Role: This is your local “Source of Truth.” Git LFS keeps this here so that if you switch branches and then come back, it doesn’t have to download the 250MB from the internet again.

3. The Working Directory (The Second Full Copy)

When you run git checkout, the Git LFS smudge filter is triggered. It reads the 100-byte pointer from Git, looks up the 250MB binary in your .git/lfs/objects/ cache, and copies it into your project folder.

Size: 250MB.
Role: This is the file your DAW or video editor actually opens.

When do you have more than 2 copies?

These scenarios increase the duplication:

Multiple Branches: If you have 5 different edits of the same 250MB video file, and you have checked them out at different times, you will have 5 full copies in .git/lfs/objects/ and 1 copy in your working directory. Total: 1.5 GB for one active file.
Staging Area (The “Index”): During a git add, Git LFS runs the clean filter. It creates a temporary copy while hashing the file to move it into the LFS storage.
Cross-Drive Projects: If your .git folder is on a different drive (or partition) than your working files, the OS is forced to perform a full byte-for-byte copy during checkout.

The “Immutable” Optimization You’re Looking For

Because your large files are immutable, you correctly identify that we should only need one copy.

Git LFS fails here because its default behavior is cp (copy). * DVC/XVC/Dud succeed here because they use ln (hard link) or reflink.

If you were to stay with Git LFS, you would have to manually run git lfs dedup after every checkout to turn those two 250MB copies into a single 250MB physical allocation via reflinks. This is why a tool like XVC is better for you; it makes that “one copy” behavior the automated default.

Gemini’s suggestion of running dedup after every checkout was not practical, and the expanation was so over-simplified as to be terrible advice. Yes, hardlinks are possible, but the machinery coordinated by the Git client and the Git LFS client extension would need to be replaced. I eventually ended up doing that for a project, it is was a large and complex setup with many moving parts.

Mike Slinn is it true that all git lfs client-side traffic must flow through the git client? how might that be a bottleneck for git lfs signed urls?

Gemini Actually, it is not true that all Git LFS traffic flows through the Git client. Understanding this distinction is key to diagnosing the performance issues you’re seeing on your workstation.

1. The “Control Plane” vs. “Data Plane”

Git LFS separates the negotiation from the transfer.

Control Plane (Git Client): When you run git pull, the Git client talks to the Git server (like GitHub) to exchange those small 100-byte pointer files. Git itself handles the versioning logic.
Data Plane (LFS Client): Once Git encounters a pointer, it triggers the Git LFS agent (a separate binary). This agent talks directly to the LFS server (often an S3 bucket or Artifactory) using the Batch API.

The data transfer happens directly between the git-lfs binary and the storage backend. It does not go through the main git process.

2. How Signed URLs Bypass the Server

Modern LFS implementations (like GitHub or GitLab) use Signed URLs specifically to avoid bottlenecks.

Request: Your LFS client asks the LFS server: “I need the bytes for SHA-256: ad1b8....”
Redirect: The server doesn’t send the data. It sends back a Temporary Signed URL (e.g., an AWS S3 URL with an authentication token).
Direct Download: Your LFS client then makes a standard HTTP GET request directly to S3.

The data never touches the Git server. It flows from S3 directly to your git-lfs process and then to your SSD.

3. The Real Bottleneck: The “Smudge” Pipeline

The bottleneck you’re likely experiencing isn’t the network traffic—it’s the OS-level IPC (Inter-Process Communication).

Even though the data doesn’t flow through the Git server, it must flow through a Pipe if you use the default “Smudge” filter:

The Pipe Bottleneck: Git expects to receive the file contents on STDOUT from the smudge filter to write it to your working directory. For a 250MB video, git-lfs is effectively “piping” those 250MB to git.
Sequential Processing: By default, Git runs this process one file at a time. If you have 100 video files, Git starts the LFS agent, feeds it a pointer, waits for the 250MB download/pipe, closes it, and repeats. This is a massive serial bottleneck, especially on Windows where process creation is expensive.

4. How to Bypass the Bottleneck

Because you are on a high-performance workstation, you should stop relying on the “automatic” smudge during clone/pull.

The “Systems Programmer” approach to LFS speed:

Skip Smudging: git lfs install --skip-smudge. This tells Git to only download the 100-byte pointers. Your git pull will be near-instant.
Bulk Pull: Run git lfs pull manually afterward.
- Why? Unlike the smudge filter, git lfs pull uses the Batch API to download multiple files in parallel (usually 3–10 at a time). It writes directly to the .git/lfs/objects folder and then to your workspace, bypassing the git pipe entirely.

Summary for your 250MB files: The sluggishness you feel is the Git LFS agent being forced to talk through a single-threaded pipe to the Git process. By skipping the smudge and using xvc or git lfs pull, you move to a parallel, direct-to-disk architecture.

Would you like me to show you the Git config to increase the lfs.concurrenttransfers limit so your 250MB downloads actually saturate your network link?

Mike Slinn yes

Gemini To increase the speed of your 250MB video downloads and bypass the sequential “smudge” bottleneck, you can tune your Git LFS configuration for high-concurrency parallel transfers.

1. Increase Concurrent Transfers

By default, Git LFS downloads only 3 to 8 files at once (depending on your version). On a high-bandwidth workstation, you can safely increase this to saturate your link.

Run this command to set the limit globally:

git config --global lfs.concurrenttransfers 64

The Logic: This tells the LFS agent to initiate 64 simultaneous HTTP requests to your storage backend (e.g., S3 via Signed URLs). For 250MB files, this ensures that while one file is finishing, dozens of others are already mid-stream.
Note: Some servers may rate-limit you if this is set too high (e.g., above 100), but 64 is generally a “sweet spot” for modern systems.

2. The “Fast Path” Workflow

To actually see the benefit of this concurrency, you must avoid the git checkout smudge filter (which is serial) and use the Batch API via lfs pull.

The High-Performance Routine:

Configure LFS to skip the “oink” (Smudge):

git config --global filter.lfs.smudge "git-lfs smudge --skip -- %f"

This ensures git pull only downloads the tiny pointers, making your Git operations instant.

Pull Data in Parallel: When you actually need the video files for your DAW:
```
git lfs pull
```

Because lfs pull is a standalone command, it uses the concurrenttransfers setting to download your video assets in parallel directly to your disk.

3. Verification

You can verify your current active settings by running:

git lfs env | grep ConcurrentTransfers

Why this fixes the Signed URL bottleneck:

When using Signed URLs, the LFS server hands you a list of 100+ URLs at once.

The Default Way: Git LFS downloads URL #1, finishes it, then starts URL #2. Your 10Gbps link sits at 5% utilization.
The Concurrent Way: Git LFS fires off requests for URLs #1 through #64 simultaneously. Your network card finally has enough work to do, and your 250MB files arrive in “waves” rather than a slow trickle.

Would you like me to show you how to set up a “transfer agent” in Go if you find that the standard Git LFS client is still too slow for your specific workstation needs?

Mike Slinn yes

Gemini To gain ultimate control over your 250MB transfers and bypass the standard “smudge” bottleneck, you can write a Custom Transfer Agent in Go.

The Git LFS client supports a JSON-based protocol over stdin/stdout. When Git LFS needs to move data, it spawns your Go binary and sends it specific commands. This allows you to implement your own logic (e.g., direct S3 multipart uploads, peer-to-peer syncing, or custom encryption) while Git LFS handles the metadata.

1. The Protocol Flow

Git LFS communicates with your agent in three stages:

Initiation: Exchange capabilities (version check).
Commands: download, upload, or terminate.
Progress/Results: Your agent reports back as bytes move.

2. Go Implementation Skeleton

A basic Go agent needs to handle the JSON stream. Here is the architectural skeleton:

package main

import (
  "bufio"
  "encoding/json"
  "os"
)

type Message struct {
  Event string `json:"event"`
  Oid string `json:"oid,omitempty"`
  Size int64 `json:"size,omitempty"`
  Path string `json:"path,omitempty"`
  Action string `json:"action,omitempty"`
}

func main() {
  reader := bufio.NewReader(os.Stdin)
  // 1. Initiation: Read the init message and respond with {}
  _, _ = reader.ReadString('\n')
  os.Stdout.WriteString("{}\n")

  // 2. Main Loop: Handle transfer requests
  for {
    line, _ := reader.ReadString('\n')
    var msg Message
    json.Unmarshal([]byte(line), &msg)

    if msg.Event == "terminate" {
      break
    }

  if msg.Event == "download" {
    // Perform your high-speed Go download logic here
    // reportProgress(msg.Oid, bytesSoFar)

    // Complete the transfer
    resp, _ := json.Marshal(Message{Event: "complete", Oid: msg.Oid, Path:
    "/local/path/to/file"})
    os.Stdout.WriteString(string(resp) + "\n")
  }
}

3. Configuration

Once your Go binary (e.g., my-lfs-agent) is compiled and in your path, you tell Git LFS to use it:

# Define the agent path
git config lfs.customtransfer.myagent.path "my-lfs-agent"

# Set it as a standalone agent (bypasses the standard LFS API server)
git config lfs.standalonetransferagent myagent

# Optional: Disable concurrency if you want to handle it inside your Go code
git config lfs.customtransfer.myagent.concurrent false

Why this is the “Pro” Move for your Workstation

No Smudge Pipe: By returning the Path in the complete message, Git LFS moves your file directly into the cache. You are no longer streaming 250MB through a standard pipe.
Internal Concurrency: You can use Go routines and context to manage your own worker pools, ensuring your 10Gbps link is fully utilized without Git managing the threads.
Signed URL Optimization: If your backend uses Signed URLs, your Go agent can pre-fetch or refresh them dynamically, which the standard LFS client occasionally struggles with during long-running transfers.

Existing implementations to study:

charmbracelet/git-lfs-transfer: A Go implementation of the SSH-based transfer protocol.
nicolas-graves/lfs-s3: A specialized Go agent for S3 backends.

Mainframe image; Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License by PekoeBlaze

© Copyright 1994-2026 Michael Slinn. All rights reserved.
For requests to use this copyright-protected work in any manner, email mslinn@mslinn.com.

This website was made using Jekyll and Mike Slinn’s Jekyll Plugins.