Image by Author
The AI industry is experiencing a shift towards making large language models (LLMs) smaller and efficient, enabling users to run them on local machines without the need for powerful servers. This tutorial will guide you through running local LLMs with Cortex, highlighting its unique features and ease of use, making AI accessible to anyone with standard hardware.
Note: Cortex is currently under active development, which may lead to bugs or some features not functioning properly. You can report any issues through GitHub or Discord.
What is Cortex
Cortex is a dynamic Local AI API platform designed for easily and efficiently running and customizing Large Language Models (LLMs). It features a straightforward command-line interface (CLI) inspired by Ollama and is built entirely in C++. You can download the installer package for Windows, macOS, and Linux.
Users can select models from Hugging Face or use Cortex’s built-in models, which are stored in universal file formats for enhanced compatibility. The best part about using Cortex is its support for swappable engines, starting with llama.cpp, with plans to add ONNX Runtime and TensorRT-LLM in the future. Additionally, you get a functional server with a dashboard to view API commands and test them.
Getting Started with Cortex
Download and install Cortex by going to the official website https://cortex.so/.
Image from Cortex
After that, launch the terminal or PowerShell and type the following command to download the Llama 3.2 3B instruct model.
It will prompt you to select the various quantization versions of the model; just choose the default option, llama3.2:3b-gguf-q4-km. Depending on your internet speed, the model will be downloaded.
Available to download:
1. llama3.2:3b-gguf-q2-k
2. llama3.2:3b-gguf-q3-kl
3. llama3.2:3b-gguf-q3-km
4. llama3.2:3b-gguf-q3-ks
5. llama3.2:3b-gguf-q4-km (default)
6. llama3.2:3b-gguf-q4-ks
7. llama3.2:3b-gguf-q5-km
8. llama3.2:3b-gguf-q5-ks
9. llama3.2:3b-gguf-q6-k
10. llama3.2:3b-gguf-q8-0
Select a model (1-10): 5
Selected: llama3.2:3b-gguf-q4-km
Validating download items, please wait..
Start downloading: model.gguf
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1104 100 1104 0 0 3255 0 --:--:-- --:--:-- --:--:-- 3266
100 1925M 100 1925M 0 0 2397k 0 0:13:42 0:13:42 --:--:-- 2430k
Start downloading: model.yml
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1522 100 1522 0 0 4042 0 --:--:-- --:--:-- --:--:-- 4058
Model llama3.2:3b-gguf-q4-km downloaded successfully!
You can also pull a model from the Hugging Face Hub; just type the model’s repository ID after the `pull` command, and it will automatically download it.
$ cortex pull kingabzpro/Gemma-2-9b-it-chat-doctor
After downloading is completed, type the `run` command with the model name to start the server.
Once everything is set up, you can start asking questions from Llama 3.2.
As we can see, our model has responded accurately to the question about the tallest building in the world.
Starting server ...
Host: 127.0.0.1 Port: 39281
Server started
API Documentation available at: http://127.0.0.1:39281
Inorder to exit, type `exit()`
> What is the tallest building in the world?
The Burj Khalifa, located in Dubai, United Arab Emirates, is currently the tallest building in the world, standing at a height of 828 meters (2,722 feet) with 163 floors.
We even asked follow-up questions, and it understood the context. It’s simple and fast.
> tell me more about it.
The Burj Khalifa!
Here are some fascinating facts about the Burj Khalifa:
1. **Design and Construction**: The Burj Khalifa was designed by the American architectural firm Skidmore, Owings & Merrill, with Adrian Smith and John G. Buffman as the lead architects.
2. **Height and Floors**: The Burj Khalifa stands at a height of 828 meters (2,722 feet) with 163 ..........
When you run the server, you also have the option to access the model via Python API, CURL command, and API in other languages.
Type the URL http://127.0.0.1:39281 in your browser and start exploring what you can do with your server.
If you want to see how many models are running in the background and how much memory they are consuming, you can type the `ps` command to check it out.
+------------------------+-----------+-----------+---------+------------------------+
| Model | Engine | RAM | VRAM | Up time |
+------------------------+-----------+-----------+---------+------------------------+
| llama3.2:3b-gguf-q4-km | llama-cpp | 308.23 MB | 1.87 GB | 22 minutes, 31 seconds |
+------------------------+-----------+-----------+---------+--------------------
Conclusion
Cortex is a new platform with significant potential to transform how we use LLMs both locally and in the cloud. Its robust server capabilities provide a wide range of features that make accessing and managing models both intuitive and powerful. Similar to Ollama, Cortex allows users to test their models directly in the terminal, simplifying the process and enhancing the user experience.
In this tutorial, we have learned about Cortex, how to install it, and how to download and use Llama 3.2 locally in the terminal, I highly recommend trying it out locally and sharing your experience.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.