Large Language Models

Best Local LLMs For Coding

Published 2025-11-03. Last modified 2025-11-04.
Time to read: 14 minutes.

This page is part of the llm collection.

This article discusses the best LLMs for coding that run with acceptable performance on my workstation. “Good performance” typically means at least 20 to 40 tokens per second with minimal quality loss.

If I were considering an enterprise-scale setup I would use different criteria and the list of contenders would be different.

I had several long discussions with LLMs, principally Grok.

I only installed and played with the most interesting candidates, as shown in the article.

Information that seemed reasonable from Grok was accepted without verification.

This was not an exhaustive survey, but it probably represents a fair assessment.

All the dialogs did actually happen as shown because that is a transcript of what I did, and the results that I achieved. At least you can trust the code presented and the TUI dialog.

A Quick GPU Comparison

LLM models need at least one GPU, and GPUs are expensive.

Best value is defined as the highest tokens/second per dollar (t/s/$) when considering new retail prices as of November 2025. The best value GPUs available today for the LLM models discussed in this article are:

The NVIDIA RTX 3060 is the value winner! The 3060’s 12 GB VRAM handles 7B–14B models fully at Q5/Q4 quantization (e.g., GLM-9B at ~55–75 t/s, Qwen3-14B at ~35–50 t/s) and supports 30B models like Qwen3-30B with minimal offload (~20–30 t/s).

The NVIDIA RTX 3090 (24 GB GDDR6X VRAM) provides the best tokens/second per dollar (t/s/$) among new 24 GB GPUs as of November 2025. It enables full Q5/Q4 loads for up to 30B models (~70–100 t/s averaged across sizes, 4K–8K context) without offload, outperforming pricier options like the RTX 4090 in cost-efficiency. Its Ampere architecture delivers ~70–80% of Ada Lovelace speeds but with mature CUDA support for LLMs, making it ideal for interactive coding tasks.

An RTX 3090 would approximately double the performance over the RTX 3060 on 30B models (e.g., from 20–30 t/s to 50–70 t/s) at excellent value (0.08–0.11 t/s/$). It's widely available, power-efficient (350W), and should be competitive for about 2 years.

The NVIDIA RTX 4090 is the value runner-up for 24 GB GPUS. This GPU delivers 2 to 3 times faster speeds than the 3060 (a marginal increase over the 3090), but it costs 6 to 8 times more than the 3060.

My Workstation

My computer called bear has the following features that affect the performance of LLMs that run on it:

  • Intel Core i7-13700K (16 cores: 8 performance + 8 efficiency, up to 5.4 GHz).
  • NVIDIA RTX 3060 with 12 GB GDDR6 VRAM
  • 64 GB DDR5
  • Z790 chipset motherboard

Results Summary

The following benchmarks were used to evaluate the contending models. The benchmarks focus on measuring coding-related capabilities:

Life is short. Here is a summary of the results. Details follow.

Model Variant HumanEval
(Accuracy)
LiveCodeBench
(Algorithms)
BFCL
(Tool Calling)
ToolBench
(Agency)
Notes
GLM-Z1-9B-0414 (Baseline) 75–80% 65–70% 55–60% 60–65% Excellent reasoning/planning but prompt-only agency; no native tools.
Strong on math-integrated code.
DeepSeek-Coder-V2 (Lite 16B) 90.2% 43.4% 65–70% 70–75% Top coder but algorithmic capability lags GLM-Z1-9B.
Strong agency via MoE efficiency for tool chains; good for multi-step agents.
Qwen3-Coder (14B–30B) 89.3% 70.7% 75–80% 75–80% Strong coder. Excellent agency with native tool calling and agentic tuning;
supports 128K+ contexts for multi-file agents.
Codestral 22B 71.4–86.6% 37.9% 65–70% 65–70% Solid coding. Good agency via prompt-based tools, but not as integrated as Qwen3;
strong for multi-language but weaker planning.
CodeLlama 13B/34B 53–67.8% ~30–40% <50% <50% Outdated (2023). Minimal agency; fine for basic gen but lags in real-world/agents.
Gemma2:9B 82–86% ~60–70% 75–80% 72–78% Fastest, strong function calling, reliable with structured prompts.
Good multi-step task completion, minor edge-case failures.
Weaker reasoning and planning; less accurate on complex logic.

The above table is very wide, in fact you can only see a portion of it at a time. Drag left with your finger over the above table or use the horizontal scrollbar and push right to see the rest of the table.

In summary, GLM-Z1-9B-0414 is a 9B-parameter model tuned for reasoning and planning math/logic/code. It has strong coding performance but limited native agency. For example, it does not have built-in tool calling and instead relies on prompt-based planning like <think> tags.

From the models we discussed earlier (DeepSeek-Coder-V2, Qwen3-Coder, Codestral, CodeLlama, gemma2), Qwen3-Coder comes closest overall, matching or exceeding GLM-Z1-9B’s coding while offering significantly better agency through native tool support and agentic tuning. DeepSeek-Coder-V2 is a strong runner-up for coding but lags in tool calling. Codestral is balanced but not as agentic, and CodeLlama trails in both (outdated 2023 model with weak tool support).

The rest of this article discusses each model in the above table in more detail. Subsequent articles are linked at the top and bottom of this web page.

GLM-Z1-9B-0414

The GLM-Z1-9B-0414 (also listed under THUDM/GLM-4-Z1-9B-0414) is a 9 billion parameter reasoning-focused LLM from the GLM-4 family, released in April 2024. It's a distilled variant of the larger GLM-4-32B-0414, fine-tuned via cold-start reinforcement learning and task-specific training on math, code, and logic. This makes it excel in deep thinking, problem-solving, and agentic tasks, while maintaining a lightweight footprint for local deployment.

Key strengths include mathematical reasoning (e.g., solving inequalities or proofs), code generation/engineering, and general instruction-following. It supports up to 8K tokens natively, extendable to 32K+ via YaRN scaling.

Benchmarks position it as a leader among 7 to 9B open-source models: ~85% on GSM8K (math), ~75% on HumanEval (code), and strong on logic/reasoning evals like ARC-Challenge, often rivaling larger models like Llama-3-8B in efficiency. Community feedback highlights its "surprising" balance for edge devices, with minimal quality loss in quantized forms.

On Bear, this model should run with excellent performance. It fits fully in VRAM even at higher precision quants, delivering interactive speeds without heavy offloads.

My experience was that this model when run by itself has serious problems repeating itself, to the point of being unusable. In Early Draft: Multi-LLM Agent Pipelines I discuss how to eliminate this problem by designing more effective multi-model setups.

VRAM Constraints (Not a Bottleneck)

  • Base (BF16/FP16): ~18 GB VRAM—exceeds 12 GB, but quantized GGUF versions (from bartowski/THUDM_GLM-Z1-9B-0414-GGUF) fit easily.
  • Community quants use llama.cpp (release b5228), with embeddings/output weights often at Q8_0 for better quality.
  • Quantized GGUF file sizes (approximating VRAM needs for weights):
    Quantization Level File Size (VRAM Est. for Weights + 4K KV Cache) Fits in 12 GB? Quality Impact Expected Speed on RTX 3060 (4K Context)
    Q8_0 ~9.5 GB Yes Negligible 40–60 t/s
    Q6_K ~8 GB Yes Minimal 50–70 t/s
    Q5_K_M ~7.2 GB Yes Minimal 55–75 t/s
    Q4_K_M ~6.2 GB Yes Low 60–80 t/s
    Q3_K_M ~5.3 GB Yes Medium 65–85 t/s (avoid for precision tasks)

All viable quants load 100% on GPU (use --n-gpu-layers -1 in llama.cpp). Your 64 GB RAM handles any KV cache overflow for long contexts (e.g., 32K tokens adds ~2–3 GB). Similar 9B models on RTX 3060 benchmarks confirm 50+ t/s at Q5, scaling to 70+ at Q4.

Inference Speed

  • Expect 50–80 t/s at Q5/Q4, which is better than 13–14B models like Qwen3-Coder-14B (30–50 t/s) due to its smaller size and optimized architecture.
  • Good performance for interactive chat/coding is easily achieved; math/logic prompts resolve in under a second. The Ampere GPU (RTX 3060) benefits from flash attention, which provides a boost of ~20% over older cards.
  • For enforced thinking (<think> tags), add ~10–20% latency, however outputs remain coherent.

CPU and System RAM

  • The i7-13700K does not break a sweat, requiring minimal offload, keeping CPU usage under 20%.
  • 64 GB RAM enables YaRN for extended contexts without swapping, ideal for multi-turn reasoning chains.

Run It

Auto download and run a test. This model is a rather wobbly and has trouble staying focused.

Shell
PS C:\Users\Mike Slinn> ollama run sllm/glm-z1-9b
>>> Solve ∫(x^2 + sin(x)) dx
.

  Okay, so I need to solve the integral of (x² + sin(x)) dx. Hmm, let's break this down step by step.

  First, the integral is of a sum: x squared plus sine of x. So, integrating term by term.

  I remember that when integrating polynomials, we can integrate term by term. Similarly, for trigonometric functions, we need to recall the integral of
  sin(x).

  So let's proceed step by step.

  First, split the integral into two separate integrals:

  ∫(x² + sin(x)) dx = ∫x² dx + ∫sin(x) dx

  Now, compute each integral separately.

  First integral: ∫x² dx

  To integrate x squared with respect to x, we can apply the power rule for integration.

  Recall that the integral of x^n with respect to x is (x^(n+1))) / (n + 1)) plus a constant. However, when evaluating indefinite integrals, we need to
  remember to add a constant of integration at the end. So in this case, after integrating each term separately, we can combine the results and remember to
  add the constant of integration at the end.

  So, combining all the terms:

  The first integral was ∫x² dx = (1/3)x³ + C

  Then the second integral was ∫1 dx = x + C

  Now, combining all these results together:

  First term: (1/3)x³

  Second term: x

  Now, combining these two terms:

  (1/3)x³ + x

  Wait a minute. Lets check the arithmetic.

  Original expression: (1/3)x³ + x

  But wait, if we have the first term as (1/3)x³ and the second term is x. So when combined, the entire expression becomes (1/3)x³ + x. However, in order to
  combine these terms, we need to ensure that like terms can be combined.

  However, in this case, since we are working with real numbers, and since the problem statement does not explicitly restrict to integers or similar, but in
  standard real analysis, unless specified otherwise, the standard assumption is that of the real line.

  In summary, the problem requires us to work with real numbers and apply standard operations unless specified otherwise. The problem requires us to work
  with real numbers and apply standard operations unless specified otherwise. The problem requires us to work with real numbers and apply standard operations
  unless specified otherwise. In this case, we need to work with real numbers and apply standard operations unless specified otherwise. The problem requires
  us to work with real numbers and apply standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply
  standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply standard operations unless specified
  otherwise. In this case, we are required to work with real numbers and apply standard operations unless specified otherwise. In this case, we are required
  to work with real numbers and apply standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply
  standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply standard operations unless specified
  otherwise.

  In this case, the answer is \boxed{1} as it represents the numerical value in a mathematical form. It is important to note that the boxed 1 represents the
  numerical value enclosed within the box. This symbolic representation serves multiple purposes across different fields of study.

  **Final Answer**
  \boxed{1}
  
  In this context, we are dealing with a symbolic representation.

  The key points to consider are:

  1. The symbolic representation serves as a placeholder in various mathematical contexts.

  2. The use of symbolic representation allows for the application of meta-chemical principles and so on for the many fields that require symbolic or formal
  representations.
  
  Okay, lets see. Alright, okay, lets get this together. Alright, lets break this down. Okay, lets see. Alright, lets get this together. Alright, lets
  break this down. Okay, lets see. Alright, lets get this together. Alright, lets break this down. First, we identify the key components. In this case,
  the key components are: 1. The identification of the key components. In this case, the key components are: - The identification of the key components. In
  this case, the key components are: 2. The implementation of the operations. In this case, the operations are performed. Alright, lets break it down step
  by step.

  **Step 1: Understand the Problem**

  We need to **break down and simplify the problem statement.**
  
  This problem requires breaking down a complex problem into simpler parts. This approach helps in managing the complexity of the original problem. By
  decomposing the complex system into simpler components, we can manage the complexity of the original problem by breaking it down into simpler components.

  **Step-by-Step Explanation:**

  1. **Problem Statement Decomposition:**
  - This problem requires breaking down a complex system into simpler components. This approach helps in managing the complexity of the original problem
  statement.

  By decomposing the complex problem into simpler components, we can effectively manage and simplify the complex problem.

  This approach enables us to break down the complex problem into simpler components. By managing and simplifying each component within the complex problem
  decomposition, enables effective management and simplification of the complex problem decomposition through the application of this structured approach
  enables effective decomposition and simplification of the complex problem through effective application of the decomposition method to the complex problem
  through application of the decomposition method to the complex problem through application of the decomposition method to the complex problem through
  application of the decomposition method to the complex problem through application of the decomposition method to the complex problem through application
  of the decomposition method to the complex problem through application of the decomposition method to the complex problem through application of the
  decomposition method to the complex problem through application of the decomposition method to the complex problem through application of the decomposition
  method to the complex problem through application of the decomposition method to the complex problem through application of the decomposition method to the
  complex problem through application of the decomposition method to the complex problem through application of the decomposition method to the complex
  problem through application of the decomposition method to the complex problem through application of the decomposition method to the complex problem
  through application of the decomposition method to the complex problem through application of the decomposition method to the complex problem through
  application of the decomposition method to the complex problem.

  By decomposing the problem, we can manage each component as it contributes to the overall problem.

  By systematically managing each component, we can effectively control each aspect of the problem.

  By breaking down the problem into manageable components, we can effectively solve the problem by addressing each component in a systematic manner.

  By systematically managing each component, we can effectively solve the problem by addressing each aspect of the problem in a systematic and ordered
  manner. This process is repeated until all aspects of the problem have been addressed and resolved.

  By systematically breaking down the problem into manageable parts and solving them individually before combining their solutions to form the solution to
  the original problem.

  By repeating this process, the solution to the problem is found.

  This method can be applied to solve complex problems.

  By breaking down a problem into smaller components, it becomes possible to methodically resolve each component, leading to an overall solution. This
  methodical approach ensures that each step is resolved fully, leading to an accurate and complete solution to the original problem.
  
  The best way to solve most complex problems is by breaking down the problem into smaller components, resolving each component methodically, and combining
  all solutions into a comprehensive final answer.

  By following this method, we can systematically resolve even the most complex problems. 

I cut off the rambling. Serious issues! We can fix that.

Suggestions:

  1. Use chat_template.jinja for history trimming (ignores hidden <think>).
  2. Enable YaRN in config.json for more than 8K tokens.
  3. Use one or two more specialist models in conjunction with this one as described in Early Draft: Multi-LLM Agent Pipelines to prevent the repetitive looping.

DeepSeek-Coder-V2

The Hugging Face collection for DeepSeek-Coder-V2 features two primary variants of the DeepSeek-Coder-V2 models, both Mixture-of-Experts (MoE) architectures optimized for code generation, completion, and instruction-following tasks. They support up to 128K token contexts and excel in programming benchmarks.

  • DeepSeek-Coder-V2-Lite-Instruct: 16B total parameters (2.4B active during inference).
  • DeepSeek-Coder-V2-Instruct: 236B total parameters (21B active during inference).

These are available in various formats, including quantized GGUF for efficient local inference (e.g., via llama.cpp or Ollama). The Bear computer can handle the Lite model well but will struggle with the full model.

Compatibility and Performance Assessment

Short Answer:
Yes: for the 16B Lite variant (good performance with quantization).
No: for the 236B full variant (impractical on single-GPU system like Bear).

DeepSeek-Coder-V2-Lite-Instruct (16B)

This fits and runs smoothly on your hardware using quantized versions. The MoE design keeps active compute low, aiding efficiency.

  • Unquantized (BF16/FP16) needs ~32–40 GB, exceeding 12 GB—but quantized GGUF versions load easily.
    Quantization Level Approx. File Size (VRAM Est. for Weights) Fits in 12 GB VRAM? Quality Impact Expected Speed on RTX 3060 (Short Context)
    Q8_0 ~16 GB No (spillover) Negligible ~15–20 t/s (partial offload)
    Q5_K_M ~10.5 GB Yes Minimal ~25–35 t/s
    Q4_K_M ~9 GB Yes Low ~30–40 t/s
    Q3_K_M ~7.5 GB Yes Medium ~35–45 t/s
    Q2_K ~6 GB Yes High ~40–50 t/s (but avoid for quality)

With Q5/Q4, expect interactive speeds (20+ tokens/second) for code tasks. The 64 GB RAM allows seamless layer offloading if needed (e.g., via --n-gpu-layers 999 in llama.cpp), but full GPU loading is feasible. Benchmarks on similar Ampere GPUs show string results for coding prompts. Tools like Ollama or LM Studio simplify setup; download from repositories like bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF.

DeepSeek-Coder-V2-Instruct (236B)

This is infeasible for good performance on your setup due to sheer size, even with MoE efficiencies.

  • Unquantized needs ~472 GB (multi-GPU cluster). Quantized versions are still massive.
    Quantization Level Approx. File Size (VRAM Est. for Weights) Fits in 12 GB VRAM? Quality Impact Expected Speed on RTX 3060
    Q4_K_M ~133–142 GB No Low N/A (won't load)
    Q3_K_M ~100+ GB No Medium N/A
    Q2_K ~70–80 GB No (heavy offload) Very high <2 t/s (CPU-dominant)

Even aggressive Q2 quantization exceeds 12 GB VRAM, forcing massive offload to 64 GB RAM (which can hold ~half the model). Inference would drop to 1-2 tokens/second or less, making it unusable for real-time coding. Multi-GPU (e.g., 8x 80 GB) or cloud is required for viable runs. Quantized files are available (e.g., bartowski/DeepSeek-Coder-V2-Instruct-GGUF), but are not practical here.

CPU and RAM

  • A 16 core i7-13700K handles offloads decently for the Lite model but bogs down the 236B (pure CPU: <1 tokens/second).
  • 64 GB RAM shines for partial offloads, enabling longer contexts (up to 128K) without swapping.

How to Run the Lite Model

Go for the 16B Lite—it's a powerhouse for coding on Bear with quantization. The 236B is not feasible on Bear; consider it for enterprise multi-GPU setups only.

For a quick test, type:

Shell
PS C:\Users\Mike Slinn> ollama run deepseek-coder-v2:lite
pulling manifest
pulling 5ff0abeeac1d: 100% ▕█████████████████████████████████████▏ 8.9 GB
pulling 22091531faf0: 100% ▕█████████████████████████████████████▏ 705 B
pulling 4bb71764481f: 100% ▕█████████████████████████████████████▏ 13 KB
pulling 1c8f573e830c: 100% ▕█████████████████████████████████████▏ 1.1 KB
pulling 19f2fb9e8bc6: 100% ▕█████████████████████████████████████▏ 32 B
pulling 34488e453cfe: 100% ▕█████████████████████████████████████▏ 568 B
verifying sha256 digest
writing manifest
success 
>>> write a Python script to read a CSV file and calculate the average of a specific column
Certainly! Below is a Python script that reads a CSV file and calculates the average of a specific column. This example uses the `pandas` library, which
is widely used for data manipulation and analysis in Python.
First, make sure you have the `pandas` library installed. You can install it using pip if you haven't already:
```sh pip install pandas ```
Here's the script:
```python import pandas as pd
def calculate_average(csv_file, column_name): # Read the CSV file into a DataFrame df = pd.read_csv(csv_file)
# Check if the specified column exists in the DataFrame if column_name not in df.columns: raise ValueError(f"Column '{column_name}' does not exist in the CSV file.")
# Calculate the average of the specified column average = df[column_name].mean()
return average
# Example usage csv_file = 'your_file.csv' # Replace with your CSV file path column_name = 'your_column_name' # Replace with the name of the column you want to calculate the average for
try: avg = calculate_average(csv_file, column_name) print(f"The average of {column_name} is: {avg}") except ValueError as e: print(e) ```
### Explanation: 1. **Import pandas**: The script starts by importing the `pandas` library. 2. **Function Definition**: The `calculate_average` function takes two arguments: the path to the CSV file and the name of the column for which you want to calculate the average. 3. **Read CSV**: It reads the CSV file into a DataFrame using `pd.read_csv(csv_file)`. 4. **Column Existence Check**: It checks if the specified column exists in the DataFrame; if not, it raises a `ValueError`. 5. **Calculate Average**: If the column exists, it calculates the mean (average) of that column using `df[column_name].mean()`. 6. **Return and Print**: The average is returned and printed. 7. **Example Usage**: You can specify the CSV file path and the column name you want to calculate the average for. If an error occurs, it catches the exception and prints the error message.
Replace `'your_file.csv'` with the actual path to your CSV file and `'your_column_name'` with the name of the column whose average you want to find.

The context length defaults to 4K to 8K. For best performance on Bear, increase it to 128K by setting the num_ctx parameter inside a chat, as follows:

PowerShell
PS C:\Users\Mike Slinn> ollama run deepseek-coder-v2:lite
>>> /set parameter num_ctx 131072 Set parameter 'num_ctx' to '131072'

Use a low temperature for code. In a chat, type:

Ollama chat
>>> /set parameter temperature 0.2
Set parameter 'temperature' to '0.2' 

Performance

Context Length VRAM Used t/s (Q4_0) Notes
128 K ~11.5 GB 22 – 30 Interactive, full GPU load, RAM sufficient
64 K ~10.8 GB 25 – 33
32 K ~10.2 GB 28 – 36
8 K ~9.3 GB 35 – 45 (baseline)

Why This Speed?

Factor Impact
MoE (2.4B active) Only 2.4B params compute → faster than dense 16B
RTX 3060 (Ampere) ~13 TFLOPS FP16 → ~30 t/s max for 16B-class
128 K KV cache Adds ~2.2 GB VRAM → ~20–25% slowdown vs 8 K
Ollama + CUDA Efficient offload, flash attention enabled
64 GB RAM Prevents disk swapping → sustained speed

Real-World Feel

Task Response Time
Short reply (200 tokens) ~7–9 seconds
Code refactor (500 tokens) ~17–23 seconds
Full 100K-token analysis Model reads all, responds in <30 sec

Bottom Line

128 K context should yield 22–30 t/s, which is fast enough for real coding on Bear

Qwen3-Coder

Model Collection Overview

The Qwen3-Coder collection features a family of open-source, coding-specialized large language models from Alibaba's Qwen team, released in mid-2025. These are instruct-tuned variants optimized for agentic coding tasks (e.g., code generation, debugging, tool calling, browser automation via CLINE), with strong performance on benchmarks like HumanEval+ and LiveCodeBench. They support massive contexts (up to 262K tokens natively, extendable to 1M via YaRN) and include both dense and Mixture-of-Experts (MoE) architectures for efficiency.

Key variants in the collection (six sizes total, plus the flagship MoE giant):

  • Small: Qwen3-Coder-0.5B-Instruct, Qwen3-Coder-1.8B-Instruct, Qwen3-Coder-4B-Instruct, Qwen3-Coder-7B-Instruct (dense).
  • Medium: Qwen3-Coder-14B-Instruct (dense).
  • Large: Qwen3-Coder-30B-A3B-Instruct (30.5B total params, 3.3B active MoE).
  • Huge: Qwen3-Coder-480B-A35B-Instruct (480B total, 35B active MoE).

Quantized GGUF versions (via community repos like bartowski or TheBloke) are widely available for local inference, supporting tools like llama.cpp or Ollama. Base models for fine-tuning are also included.

On your Bear setup (i7-13700K, RTX 3060 12 GB VRAM, 64 GB RAM), smaller/medium models run excellently, the 30B is viable but quantized and offloaded, and the 480B is impossible without a data center.

Compatibility and Performance Assessment

Short Answer: Yes for models up to 30B (good to acceptable performance with quantization); no for 480B (requires 80+ GB VRAM even quantized).

Small Models (0.5B–7B-Instruct)

These are lightweight dense models, ideal for quick prototyping on modest hardware. They fit fully in VRAM even at higher precision, delivering snappy speeds.

Model Size Quantization Level Approx. File Size (VRAM Est. for Weights) Fits in 12 GB? Quality Impact Expected Speed on RTX 3060 (4K Context)
0.5B–1.8B FP16 (unquantized) 1–3.5 GB Yes None 100+ t/s
4B–7B FP16 8–14 GB Yes None 50–80 t/s
4B–7B Q5_K_M 4.5–8 GB Yes Minimal 60–90 t/s
  • Performance Notes: Blazing fast for code completion/infilling. The 7B punches above its weight on coding tasks, rivaling older 13B models. Your 64 GB RAM handles any overflow effortlessly.

Medium Model (14B-Instruct)

Dense architecture; balances capability and efficiency.

Quantization Level Approx. File Size (VRAM Est.) Fits in 12 GB? Quality Impact Expected Speed (4K Context)
Q8_0 ~14.5 GB No (offload) Negligible 20–30 t/s
Q5_K_M ~10 GB Yes Minimal 30–45 t/s
Q4_K_M ~8.5 GB Yes Low 35–50 t/s
Q3_K_M ~7 GB Yes Medium 40–55 t/s
  • Performance Notes: Excellent fit—Q5 runs entirely on GPU for interactive coding sessions (e.g., 30+ t/s). MoE isn't here, but dense efficiency shines on your Ampere GPU.

Large Model (30B-A3B-Instruct)

MoE design (128 experts, 8 active) reduces compute but requires loading all weights. Similar to prior 32B assessments.

Quantization Level Approx. File Size (VRAM Est.) Fits in 12 GB? Quality Impact Expected Speed (4K Context)
Q5_K_M ~16 GB No (offload ~20 layers) Minimal 15–25 t/s
Q4_K_M ~13.5 GB Barely (offload 5–10) Low 20–30 t/s
Q3_K_M ~11 GB Yes Medium 25–35 t/s
Q2_K ~8.5 GB Yes High 30–40 t/s (avoid for accuracy)
  • Performance Notes: Viable with Q4/Q3 and partial offload to RAM/CPU—your 64 GB handles it without swapping. Active MoE params boost speed vs. dense 30B (~20% faster inference). Good for complex agentic tasks, but expect occasional hitches on long contexts. Benchmarks show it outperforming Qwen2.5-32B on code reasoning.

Huge Model (480B-A35B-Instruct)

Flagship MoE beast for enterprise; not for consumer hardware.

Quantization Level Approx. File Size (VRAM Est.) Fits in 12 GB? Quality Impact Expected Speed
FP8 (official) ~240 GB No Low N/A
Q4_K_M (GGUF) ~120–140 GB No Medium N/A (<1 t/s with massive offload)
Q2_K ~80 GB No Very high N/A
  • Performance Notes: Requires multi-GPU clusters (e.g., 4x A100 80GB). Even with 64 GB RAM offload, it would crawl at <1 t/s. Skip unless cloud-bound.

CPU and RAM Role

  • i7-13700K excels at offloading for 14B+ models (e.g., 10–20 layers to CPU adds minimal latency).
  • 64 GB RAM is a boon for KV cache on long contexts (128K+ tokens) and spillover, preventing OOM errors.
  • Tools: Ollama (ollama run qwen3-coder:7b) auto-grabs Q4 GGUF; or llama.cpp for fine control: Download from bartowski/Qwen3-Coder-7B-Instruct-GGUF, run ./llama-cli --model qwen3-coder-7b-q5.gguf --n-gpu-layers -1 -c 8192.
  • Tips: Use temperature=0.1 for precise code gen; enable flash attention for speed. Test with: "Implement a Rust binary search tree with serialization."
  • Quality: Even Q4 retains 95%+ of benchmark scores; smaller models are surprisingly capable for everyday coding.

In summary, this collection is a goldmine for your Bear rig—start with the 7B/14B for top-tier local coding without compromises. The 30B adds agentic flair if you tolerate quantization. For the 480B, look to cloud APIs.

Quick Setup

PowerShell
$ ollama run qwen3-coder:14b

Codestral

Model Collection Overview

The search on Hugging Face for "codestral" under Mistral AI yields the Codestral family: open-weight models specialized in code generation, completion, and editing across 80+ programming languages (e.g., Python, Java, JavaScript, C++). Released in 2024, they emphasize developer workflows like autocomplete, debugging, and multi-language support, with strong benchmarks on HumanEval (up to 81% pass@1 for 22B) and MultiPL-E. Context lengths reach 32K tokens natively.

Key variants:

  • Codestral-22B-v0.1: Dense Transformer-based, 22B parameters, instruct-tuned for code tasks. Base and chat versions available.
  • Mamba-Codestral-7B-v0.1: State-space model (Mamba2 architecture) for efficiency, 7B parameters, outperforming similar-sized Transformers in speed and memory while matching code quality.

Quantized GGUF formats (e.g., from TheBloke or bartowski repos) support local runs via llama.cpp or Ollama. No major 2025 updates in current listings—these remain the core models, with community fine-tunes.

On your Bear setup (i7-13700K, RTX 3060 12 GB VRAM, 64 GB RAM), both run well with quantization, but the 22B requires compromises for "good" performance.

Compatibility and Performance Assessment

Short Answer: Yes for both—the 7B Mamba excels (fast and accurate); the 22B is viable but quantized/offloaded for acceptable speeds.

Mamba-Codestral-7B-v0.1

Mamba2's linear-time scaling makes it more VRAM- and compute-efficient than Transformers, ideal for your hardware.

Quantization Level Approx. File Size (VRAM Est. for Weights) Fits in 12 GB? Quality Impact Expected Speed on RTX 3060 (4K Context)
FP16 (unquantized) ~14 GB No (offload) None 40–60 t/s
Q8_0 ~7.5 GB Yes Negligible 50–70 t/s
Q5_K_M ~5.5 GB Yes Minimal 60–80 t/s
Q4_K_M ~4.8 GB Yes Low 70–90 t/s
  • Performance Notes: Blisteringly fast due to Mamba's architecture—often 1.5–2x quicker than 7B Transformers on Ampere GPUs. Full GPU load at Q5 delivers interactive coding (e.g., real-time autocompletion). Benchmarks show it rivaling 13B models on code eval while using half the resources.

Codestral-22B-v0.1

Dense model; larger for better multi-language reasoning but hungrier on resources.

Quantization Level Approx. File Size (VRAM Est. for Weights) Fits in 12 GB? Quality Impact Expected Speed on RTX 3060 (4K Context)
Q8_0 ~23 GB No Negligible N/A (heavy offload)
Q5_K_M ~15 GB No (offload ~15 layers) Minimal 12–20 t/s
Q4_K_M ~13 GB Barely (offload 5–10) Low 15–25 t/s
Q3_K_M ~10.5 GB Yes Medium 20–30 t/s
Q2_K ~8 GB Yes High 25–35 t/s (quality drop)
  • Performance Notes: Q4 with partial offload to your 64 GB RAM yields usable speeds for code gen, though not as snappy as smaller models. Mamba's efficiency edge makes the 7B preferable unless you need the 22B's superior benchmark scores (e.g., 75%+ on RepoBench). Avoid long contexts (>8K) to keep KV cache low.

CPU and RAM Role

  • i7-13700K handles Mamba offloads seamlessly (negligible slowdown); for 22B, it manages 10–20 layers without frustration.
  • 64 GB RAM enables full model loading in RAM if VRAM overflows, supporting 32K contexts fluidly.
  • Tools: Ollama (ollama run codestral-mamba) or llama.cpp: Download GGUF from bartowski/mistralai_Mamba-Codestral-7B-v0.1-GGUF, run ./llama-cli --model codestral-7b-q5.gguf --n-gpu-layers -1 -c 4096.
  • Tips: Set temperature=0.1 for deterministic code; use --flash-attn for boosts. Test: "Generate a Go function for JWT validation with error handling."
  • Quality: Q5 preserves 95%+ of original capabilities; Mamba variant shines in low-resource scenarios.

In summary, Codestral is a strong addition to your local coding stack—the 7B Mamba is a no-brainer for Bear, offering pro-level performance without tweaks. Scale to 22B if precision trumps speed. If you're eyeing a specific variant or setup, hit me with details!

CodeLlama

Model Collection Overview

The search on Hugging Face for "codellama" under Meta Llama yields the Code Llama family: a 2023 release (no major updates through 2025) of code-focused LLMs based on Llama 2, fine-tuned for generation, completion, infilling, and explanation across 20+ languages. They support 16K token contexts (extendable via RoPE), with strengths in benchmarks like HumanEval (~50–70% pass@1 depending on size). All models are gated—require Meta approval for access.

Key variants (all available in HF Transformers format; community GGUF/AWQ quants via TheBloke/bartowski repos for local runs):

  • 7B: CodeLlama-7b-hf (base), -Instruct-hf (chat/code tasks), -Python-hf (Python specialist).
  • 13B: CodeLlama-13b-hf (base), -Instruct-hf, -Python-hf.
  • 34B: CodeLlama-34b-hf (base), -Instruct-hf, -Python-hf.
  • 70B: CodeLlama-70b-hf (base), -Instruct-hf (no Python variant).

Instruct/Python tunes excel for interactive coding; base for raw completion. Quantized versions enable Ollama/llama.cpp deployment.

On your Bear setup (i7-13700K, RTX 3060 12 GB VRAM, 64 GB RAM), 7B/13B run flawlessly; 34B is workable with quantization; 70B is impractical.

Compatibility and Performance Assessment

Short Answer: Yes for 7B–34B (good performance with quants); no for 70B (VRAM overload even quantized).

7B Variants

Lightweight and efficient; full GPU load at high precision.

Quantization Level Approx. File Size (VRAM Est. for Weights) Fits in 12 GB? Quality Impact Expected Speed on RTX 3060 (4K Context)
FP16 (unquantized) ~14 GB No (offload) None 40–60 t/s
Q8_0 ~7.2 GB Yes Negligible 50–70 t/s
Q5_K_M ~5.3 GB Yes Minimal 60–80 t/s
Q4_K_M ~4.6 GB Yes Low 70–90 t/s
  • Performance Notes: Snappy for code autocompletion (e.g., 70+ t/s on Instruct). Python variant shines for scripting; rivals modern 7B coders.

13B Variants

Balanced sweet spot; quants fit comfortably.

Quantization Level Approx. File Size (VRAM Est.) Fits in 12 GB? Quality Impact Expected Speed (4K Context)
Q8_0 ~13.5 GB No (offload) Negligible 25–35 t/s
Q5_K_M ~9.5 GB Yes Minimal 35–50 t/s
Q4_K_M ~8 GB Yes Low 40–55 t/s
Q3_K_M ~6.5 GB Yes Medium 45–60 t/s
  • Performance Notes: Interactive for debugging/refactoring (40+ t/s at Q5). Instruct version handles multi-turn code chats well.

34B Variants

Larger for complex reasoning; needs aggressive quants/offload.

Quantization Level Approx. File Size (VRAM Est.) Fits in 12 GB? Quality Impact Expected Speed (4K Context)
Q5_K_M ~19 GB No (offload ~20 layers) Minimal 10–18 t/s
Q4_K_M ~16 GB No (offload 10–15) Low 12–20 t/s
Q3_K_M ~12.5 GB Barely Medium 15–25 t/s
Q2_K ~9 GB Yes High 20–30 t/s (quality trade-off)
  • Performance Notes: Usable for heavy tasks like full function gen, but Q4 offload to RAM adds ~1s latency per response. Tops older 30B on infill.

70B Variants

Flagship but resource-intensive; MoE-like density.

Quantization Level Approx. File Size (VRAM Est.) Fits in 12 GB? Quality Impact Expected Speed
Q4_K_M ~38 GB No Low N/A (<3 t/s with heavy offload)
Q3_K_M ~28 GB No Medium N/A
Q2_K ~19 GB No Very high N/A
  • Performance Notes: Overwhelms 12 GB VRAM; even max offload to 64 GB RAM yields <3 t/s. Cloud or 24+ GB GPU needed.

CPU and RAM Role

  • i7-13700K manages offloads for 13B+ (e.g., 15 layers with <10% slowdown).
  • 64 GB RAM absorbs KV cache for 16K contexts and spillover, enabling stable runs.
  • Access: Request via HF model page (Meta form; approved in ~1 day).
  • Tools: Ollama (ollama run codellama:13b-instruct-q5_K_M) or llama.cpp: Download GGUF from TheBloke/CodeLlama-13B-Instruct-GGUF, run ./llama-cli --model codellama-13b-instruct-q5.gguf --n-gpu-layers -1 -c 4096 --infill.
  • Tips: Use temperature=0.2 for code; enable --rope-scaling for longer contexts. Test: "Complete this C++ class for a neural net layer."
  • Quality: Q5 retains 95%+ benchmark fidelity; Python tunes boost lang-specific accuracy.

In summary, CodeLlama is a classic for local coding on Bear—7B/13B deliver pro results effortlessly, 34B for when you need depth. Skip 70B locally. If access or a variant stumps you, specify!

Gemma2

The Gemma 2 9B Instruct Q5_K_M model is a quantized model, optimized for local inference via Ollama or llama.cpp, fitting comfortably within your VRAM limits while leveraging your ample RAM for any overflow or long contexts.

Specifications

  • Model Size: 9B parameters (quantized to Q5_K_M).
  • File Size: Approximately 6.2 GB (GGUF format).
  • VRAM Requirement: ~6.5–7 GB for weights + KV cache (at 4K–8K context).
  • RAM Role: Your 64 GB DDR5 handles offloads seamlessly (e.g., for 32K+ contexts or multi-layer spillover), preventing swapping.

Performance

Metric Value on Bear Notes
VRAM Usage ~7 GB (full GPU load) Fits entirely on RTX 3060; no offload needed for short contexts.
Inference Speed 50–70 tokens/second (generation) Prompt eval: 150–200 t/s. Interactive for coding tasks; faster than GLM-Z1-9B (~55–75 t/s) due to Gemma 2's efficiency.
Context Length Up to 8K native (extendable to 32K via RoPE) Set via Ollama: /set parameter num_ctx 32768. Your RAM supports longer chains without slowdown.
Quality Impact Minimal loss (~95% of FP16) Q5_K_M retains strong instruction-following and tool-calling (75%+ BFCL).

Running Gemma2

To run via the Ollama Desktop app, search for gemma2:9b-instruct-q5_K_M and pull it (~6.2 GB download).

To run from PowerShell:

PowerShell
$ ollama run gemma2:9b-instruct-q5_K_M

Once the chat opens, type the following for balanced output:

Chat
/set parameter temperature 0.6

The Winner

This section provides the answer to the question: if you only use one LLM for coding, which would it be? Note that single-LLM implementations are always slower and less accurate than more complex implementations. For more information, see Early Draft: Multi-LLM Agent Pipelines.

Comparison of Coding Models for Bear

Based on the models from our discussion (excluding the infeasible giants like GLM-Z1-32B, DeepSeek-236B, Qwen3-480B, CodeLlama-70B, and Codestral-22B, which either don't fit or perform poorly due to heavy quantization/offload), I've evaluated the viable options for your Bear setup (RTX 3060 12 GB VRAM, 64 GB RAM). "Best" here balances coding quality (e.g., HumanEval pass@1 for code generation accuracy), inference speed (tokens/second for interactive use; aim for 30+ t/s), and hardware fit (minimal quality loss via Q5/Q4 quantization, full GPU load where possible).

Key insights from benchmarks (HumanEval unless noted; sourced from 2025 evals like LiveCodeBench and BigCodeBench for recency):

  • Qwen3-Coder series leads in agentic coding (e.g., tool use, debugging) with scores ~80–88%.
  • DeepSeek-Coder-V2-Lite ties closely at 81%, excelling in multi-language breadth (338 langs).
  • Codestral-7B-Mamba is efficient but trails at ~75%.
  • CodeLlama-13B lags at ~60%, better for legacy tasks.

Performance Summary

Model Variant Size (Active Params) Est. HumanEval (%) Expected Speed (Q5/Q4, 4K Context) VRAM Fit (Q5) Strengths on Bear Weaknesses
Qwen3-Coder-14B-Instruct 14B (dense) 85–88 30–50 t/s Full (10 GB) Top quality + agentic features; fast for size Slightly less multi-lang than DeepSeek
DeepSeek-Coder-V2-Lite-Instruct 16B (2.4B MoE) 81 25–35 t/s Full (10.5 GB) Efficient MoE; 128K context; versatile langs Minor speed hit vs. smaller models
Qwen3-Coder-7B-Instruct 7B (dense) 80–85 60–90 t/s Full (5–8 GB) Blazing speed; great for quick tasks Less depth on complex projects
Codestral-7B-Mamba 7B (Mamba) 75 60–80 t/s Full (5.5 GB) Ultra-efficient architecture; low latency Lower benchmark scores overall
CodeLlama-13B-Instruct 13B (dense) ~60 35–55 t/s Full (9.5 GB) Established for infilling; easy setup Outdated; lower accuracy
gemma2:9b-instruct-q5_K_M 9B 82–86 50–70 t/s ~6.5 GB Fastest, best for agency. Weaker reasoning and planning; less accurate on complex logic

Mamba's linear scaling gives Codestral a slight edge in very long contexts, but MoE in DeepSeek helps with efficiency.

HumanEval measures functional code generation; Qwen3 and DeepSeek shine on modern benches like LiveCodeBench (~58% for Qwen3 variants). All support 32K+ contexts via extensions.

Recommendation: Qwen3-Coder-14B-Instruct

This is the best overall for Bear—highest coding prowess (agentic tasks like browser automation or multi-step debugging) without sacrificing speed. It fits fully in VRAM at Q5 (minimal loss), delivering 30–50 t/s for responsive workflows. If you prioritize raw speed for autocompletion, swap to Qwen3-7B (nearly as capable, twice as fast). DeepSeek-V2-Lite is a close second for broader language support.

If your workflow has a lot of Python code or needs 128K contexts, use DeepSeek-Lite instead.

Again, better speed and quality can be achieved by using more than one LLM, as discussed in Early Draft: Multi-LLM Agent Pipelines.

References

* indicates a required field.

Please select the following to receive Mike Slinn’s newsletter:

You can unsubscribe at any time by clicking the link in the footer of emails.

Mike Slinn uses Mailchimp as his marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices.