Published 2024-01-14.
Time to read: 4 minutes.
llm
collection.
I've been playing with large language models (LLMs) online and locally. Local models are not as powerful, or as fast, as the OpenAI model, but you have complete control over them without censorship.
Ollama is a way to run large language models (LLMs) locally. It is an open source tool built in Go for running and packaging generative machine learning models. Ollama can be used procedurally, via chat, via a web interface, and via its REST interface.
Here are some example open-source models that can be downloaded:
Model | Parameters | Size | Download Command |
---|---|---|---|
Llama 2 | 7B | 3.8GB | ollama run llama2
|
Mistral | 7B | 4.1GB | ollama run mistral
|
Dolphin Phi | 2.7B | 1.6GB | ollama run dolphin-phi
|
Phi-2 | 2.7B | 1.7GB | ollama run phi
|
Neural Chat | 7B | 4.1GB | ollama run neural-chat
|
Starling | 7B | 4.1GB | ollama run starling-lm
|
Code Llama | 7B | 3.8GB | ollama run codellama
|
Llama 2 Uncensored | 7B | 3.8GB | ollama run llama2-uncensored
|
Llama 2 13B | 13B | 7.3GB | ollama run llama2:13b
|
Llama 2 70B | 70B | 39GB | ollama run llama2:70b
|
Orca Mini | 3B | 1.9GB | ollama run orca-mini
|
Vicuna | 7B | 3.8GB | ollama run vicuna
|
LLaVA | 7B | 4.5GB | ollama run llava
|
Each model has unique attributes. Some are designed for describing images, while others are designed for generating music, or other special purposes.
The 70B parameter model really puts a strain on the computer, and takes much longer than other models to yield a result.
Installation
I installed Ollama on WSL like this:
$ curl -s https://ollama.ai/install.sh | sh >>> Downloading ollama... ################################################################# 100.0%#=#=# ################################################################# 100.0% >>> Installing ollama to /usr/local/bin... >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service. >>> NVIDIA GPU installed.
Bind: address already in use
The new user experience has a speed bump. Supposedly this was fixed, but I hit it.
$ ollama serve Couldn't find '/home/mslinn/.ollama/id_ed25519'. Generating new private key. Your new public key is:
ssh-ed25519 AAAAC3NzaC1lZDI1ABCDEFIDuwe6mUdM4aXFZYxpVYejCO/Kp/cq1HrfC59w7qb
Error: listen tcp 127.0.0.1:11434: bind: address already in use
I do not know why the highlighted message appeared, because I checked and the port was available. Restarting the server solved the problem.
MacOS Solution
On Mac the app, running in the toolbar, automatically restarts the server if it stops. To stop the server, exit the toolbar app. You can restart it like this:
$ brew services restart ollama
Linux Solution
On Linux, the Ollama server is added as a system service. You can control it with these commands:
$ sudo systemctl stop ollama
$ sudo systemctl start ollama
$ sudo systemctl restart ollama
Command Line Start
You can start the server from the command line, if it is not already running as a service:
$ ollama serve 2024/01/14 16:25:20 images.go:808: total blobs: 0 2024/01/14 16:25:20 images.go:815: total unused blobs removed: 0 2024/01/14 16:25:20 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20) 2024/01/14 16:25:21 shim_ext_server.go:142: Dynamic LLM variants [cuda rocm] 2024/01/14 16:25:21 gpu.go:88: Detecting GPU type 2024/01/14 16:25:21 gpu.go:203: Searching for GPU management library libnvidia-ml.so 2024/01/14 16:25:21 gpu.go:248: Discovered GPU libraries: [/usr/lib/wsl/lib/libnvidia-ml.so.1] 2024/01/14 16:25:21 gpu.go:94: Nvidia GPU detected 2024/01/14 16:25:21 gpu.go:135: CUDA Compute Capability detected: 8.6
Ollama Models
Ollama uses models on demand; the models are ignored if no queries are active. That means you do not have to restart ollama after installing a new model or removing an existing model.
My workstation has 64 GB RAM, a 13th generation Intel i7 and a modest NVIDIA 3060. I decided to try the biggest model to see what might happen. I downloaded the Llama 2 70B model with the following incantation. (Spoiler: An NVIDIA 4090 would have been better video card for this Ollama model, and it would still be slow.)
$ ollama run llama2:70b pulling manifest pulling 68bbe6dc9cf4... 100% ▕████████████████████████████████████▏ 38 GB pulling 8c17c2ebb0ea... 100% ▕████████████████████████████████████▏ 7.0 KB pulling 7c23fb36d801... 100% ▕████████████████████████████████████▏ 4.8 KB pulling 2e0493f67d0c... 100% ▕████████████████████████████████████▏ 59 B pulling fa304d675061... 100% ▕████████████████████████████████████▏ 91 B pulling 7c96b46dca6c... 100% ▕████████████████████████████████████▏ 558 B verifying sha256 digest writing manifest removing any unused layers success >>> Send a message (/? for help)
I played around to learn what the available messages were. For more information, see Tutorial: Set Session System Message in Ollama CLI by Ingrid Stevens.
>>> /? Available Commands: /set Set session variables /show Show model information /bye Exit /?, /help Help for a command /? shortcuts Help for keyboard shortcuts Use """ to begin a multi-line message. >>> Send a message (/? for help) >>> /show Available Commands: /show info Show details for this model /show license Show model license /show modelfile Show Modelfile for this model /show parameters Show parameters for this model /show system Show system message /show template Show prompt template >>> /show modelfile # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama2:70b FROM /usr/share/ollama/.ollama/models/blobs/sha256:68bbe6dc9cf42eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 TEMPLATE """[INST] <<SYS>>{{ .System }}<</SYS>> {{ .Prompt }} [/INST] """ PARAMETER stop "[INST]" PARAMETER stop "[/INST]" PARAMETER stop "<<SYS>>" >>> /show system No system message was specified for this model.
>>> /show template [INST] <<SYS>>{{ .System }}<</SYS>>
{{ .Prompt }} [/INST] >>> %}/bye
USER:
and ASSISTANT:
are helpful when writing a request for the model to reply to.
By default, Ollama models are stored in these directories:
- Linux:
/usr/share/ollama/.ollama/models
- macOS:
~/.ollama/models
The Ollama library has many models available. OllamaHub has more. For applications that may not be safe for work, there is an equivalent uncensored Llama2 70B model that can be downloaded. Do not try to work with this model unless you have a really powerful machine!
$ ollama pull llama2-uncensored:70b pulling manifest pulling abca3de387b6... 100% ▕█████████████████████████████████████▏ 38 GB pulling 9224016baa40... 100% ▕█████████████████████████████████████▏ 7.0 KB pulling 1195ea171610... 100% ▕█████████████████████████████████████▏ 4.8 KB pulling 28577ba2177f... 100% ▕█████████████████████████████████████▏ 55 B pulling ddaa351c1f3d... 100% ▕█████████████████████████████████████▏ 51 B pulling 9256cd2888b0... 100% ▕█████████████████████████████████████▏ 530 B verifying sha256 digest writing manifest removing any unused layers success
Some additional models that interested me:
falcon
- A large language model built by the Technology Innovation Institute (TII) for use in summarization, text generation, and chat bots.samantha-mistral
- A companion assistant trained in philosophy, psychology, and personal relationships. Based on Mistral.yarn-llama2
- An extension of Llama 2 that supports a context of up to 128k tokens.
I then listed the models on my computer in another console:
$ ollama list NAME ID SIZE MODIFIED llama2:70b e7f6c06ffef4 38 GB 9 minutes ago
Running Queries
Ollama queries can be run in many ways
I used curl
, jq
and fold
to write my first query from a bash prompt.
The -s
option for curl
prevents the progress meter from cluttering up the screen,
and the jq
filter removes everything from the response except the desired text.
The fold
command wraps the text response to a width of 72 characters.
$ curl -s http://localhost:11434/api/generate -d '{ "model": "llama2:70b", "prompt": "Why is there air?", "stream": false }' | jq -r .response | fold -w 72 -s Air, or more specifically oxygen, is essential for life as we know it. It exists because of the delicate balance of chemical reactions in Earth’s atmosphere, which has allowed complex organisms like ourselves to evolve.
But if you’re asking about air in a broader sense, it serves many functions: it helps maintain a stable climate, protects living things from harmful solar radiation, and provides buoyancy for various forms of life, such as fish or birds.
Describing Images
I wrote this method to describe images.
def describe_image(image_filename) @client = Ollama.new( credentials: { address: @address }, options: { server_sent_events: true, temperature: @temperature, connection: { request: { timeout: @timeout, read_timeout: @timeout } }, } ) result = @client.generate( { model: @model, prompt: 'Please describe this image.', images: [Base64.strict_encode64(File.read(image_filename))], } ) puts result.map { |x| x['response'] }.join end
The results were ridiculous - an example of the famous hallucination that LLMs entertain their audience with. As the public becomes enculturated with these hallucinations, we may come to prefer them over human comedians. Certainly there will be a lot of material for the human comedians to fight back with. For example, when describing the photo of me at the top of this page:
Another attempt, with an equally ridiculous result:
The llava
model
is supposed to be good at describing images, so I installed it and tried again, with excellent results:
$ ollama pull llava:13b
You can try the latest LLaVA model online.
Documentation
- CLI Reference
- Ollama API.
- There are lots of controls for various models.
- Ollama Web UI
- An Apple M1 Air works great.
- Crafting Conversations with Ollama-WebUI: Your Server, Your Rules