Introduction
When Apple announced the M3 chip in the new MacBook Pro at their “Scary Fast” event in October, the the first questions a lot of us were asking were, “How fast can LLMs run locally on the M3 Max?”. There has been a lot of performance using the M2 Ultra on the Mac Studio which was essentially two M2 chips together. The large RAM created a system that was powerful to run a lot of workloads but for the first time, we could see the large RAM (96GB to 128GB) in a MacBook and allow us to take the Apple Studio workloads on the road (or show off in a coffee shop with the new Space Black color scheme).
In this blog post, we will focus on the performance of running LLMs locally and compare the tokens per second on each of the different models. We will be leveraging the default models pulled from Ollama and not be going into the specific custom trained models or pulling anything custom from PyTorch that are supported by Ollama as well.
M3 Max LLM Testing Hardware
For this test, we are using the 14″ M3 MacBook Pro with the upgraded M3 Max chip and maximum RAM.
CPU | 16 Cores (12 performance and 4 efficiency) |
GPU | 40 Cores |
RAM | 128 GB |
STORAGE | 1 TB |
M3 Max Battery Settings
With the different power settings available and the ability of the Mac to shift between different performance states, we are going to use the High Performance settings found under battery settings to confirm there is no throttling happening on the hardware.
Local LLMs with Ollama
Before we can start exploring the performance of Ollama on the M3 Mac chip, it is important to understand how to set it up. The process is relatively simple and straightforward. First, you need to download the Ollama application from the official website. Once downloaded, follow the installation instructions provided. Once the installation is complete, you are ready to explore the performance of Ollama on the M3 Mac chip.
LLM Model Selection
Ollama out of the box allows you to run a blend of censored and uncensored models. For the test to determine the tokens per second on the M3 Max chip, we will focus on the 8 models on the Ollama Github page each individually.
Model | Parameters | Size | Download |
---|---|---|---|
Mistral | 7B | 4.1GB | ollama run mistral |
Llama 2 | 7B | 3.8GB | ollama run llama2 |
Code Llama | 7B | 3.8GB | ollama run codellama |
Llama 2 Uncensored | 7B | 3.8GB | ollama run llama2-uncensored |
Llama 2 13B | 13B | 7.3GB | ollama run llama2:13b |
Llama 2 70B | 70B | 39GB | ollama run llama2:70b |
Orca Mini | 3B | 1.9GB | ollama run orca-mini |
Vicuna | 7B | 3.8GB | ollama run vicuna |
Running Mistral on M3 Max
Mistral is a 7B parameter model that is about 4.1 GB on disk. Running it locally via Ollama running the command:
% ollama run mistral
Mistral M3 Max Performance
Prompt eval rate comes in at 103 tokens/s. The eval rate of the response comes in at 65 tokens/s.
Running Llama 2 on M3 Max
% ollama run llama2
Llama 2 M3 Max Performance
Prompt eval rate comes in at 124 tokens/s. The eval rate of the response comes in at 64 tokens/s.
Running Code Llama on M3 Max
Code Llama is a 7B parameter model tuned to output software code and is about 3.8 GB on disk. Running it locally via Ollama running the command:
% ollama run code llama
Code Llama Uncensored M3 Max Performance
Prompt eval rate comes in at 140 tokens/s. The eval rate of the response comes in at 61 tokens/s.
Running Llama 2 Uncensored on M3 Max
Llama 2 Uncensored is a 7B parameter model that is about 3.8 GB on disk. Running it locally via Ollama running the command:
% ollama run llama2-uncensored
Llama 2 Uncensored M3 Max Performance
Prompt eval rate comes in at 192 tokens/s. The eval rate of the response comes in at 64 tokens/s.
Running Llama 2 13B on M3 Max
Llama 2 13B is the larger model of Llama 2 and is about 7.3 GB on disk. Running it locally via Ollama running the command:
% ollama run llama2:13b
Llama 2 13B M3 Max Performance
Prompt eval rate comes in at 17 tokens/s. The eval rate of the response comes in at 39 tokens/s.
Running Llama 2 70B on M3 Max
Llama 2 70B is the largest model and is about 39 GB on disk. Running it locally via Ollama running the command:
% ollama run llama2:70b
Llama 2 70B M3 Max Performance
Prompt eval rate comes in at 19 tokens/s. The eval rate of the response comes in at 8.5 tokens/s.
Running Orca Mini on M3 Max
Orca Mini is a 3B parameter model that is about 1.9 GB on disk. Running it locally via Ollama running the command:
% ollama run orca-mini
Orca Mini M3 Max Performance
Prompt eval rate comes in at 298 tokens/s. The eval rate of the response comes in at 109 tokens/s.
Running Vicuna on M3 Max
Vicuna is a 7B parameter model that is about 3.8 GB on disk. Running it locally via Ollama running the command:
% ollama run vicuna
Vicuna M3 Max Performance
Prompt eval rate comes in at 204 tokens/s. The eval rate of the response comes in at 67 tokens/s.
Summary of running LLMs locally on M3 Max
The power of the M3 Max chip brings a lot of desktop compute to the laptop in a portable manner. For the smaller models like Orca Mini which use only a small amount of RAM, we see blazing fast tokens per second. With the fine-tuning done and performance of the Mistral models being done in the space as well as the performance of Llama2 and Code Llama, you can easily run your own models locally at high speeds without risking privacy of your data.
Model | Tokens/sec |
---|---|
Mistral | 65 tokens/second |
Llama 2 | 64 tokens/second |
Code Llama | 61 tokens/second |
Llama 2 Uncensored | 64 tokens/second |
Llama 2 13B | 39 tokens/second |
Llama 2 70B | 8.5 tokens/second |
Orca Mini | 109 tokens/second |
Vicuna | 67 tokens/second |
While steps such as fine tuning and training larger models is still not as efficient on consumer laptops, the power of the M3 chip in a 14″ platform brings a lot of capability to those developers who are doing their own LLM projects and want to save some costs using hosted models and APIs.
Stay tuned for more content around building LLM powered applications and SaaS products! Stay in the loop and subscribe to our newsletter!