English content

Deployment of an LLM with local RAG Ollama and PrivateGPT

Olivier 10 min de lecture
Image generated by ChatGPT 4o
Image generated by ChatGPT 4o

POC to obtain your private and free AI with Ollama and PrivateGPT

This article follows my first two articles on similar deployments, namely LM Studio as an inference server (LLM model) with its very restrictive license and then with Jan whose AGPL v3 license is one of the most ethical.

Here we will create this POC with Ollama, which is graciously distributed under the Open Source MIT license. All the information is here: https://ollama.com/
No graphical interface but command line (CLI), although it is entirely possible to install a graphical interface, Web or more comprehensive, for Ollama, thanks to the many projects that offer it. The goal here is to stay minimalist and get the job done with as few components as possible.

I assume you have followed my first article, at least for the theoretical explanations and the basic tools necessary for building this POC (Proof of Concept). Otherwise, I strongly encourage you to do so if you do not already have the required skills in this area!

Technologies required for this POC

  • Apple MacBook Pro M2 MAX platform with 32 GB of RAM.
  • Homebrew as a package manager for the macOS operating system.
  • iTerm2 as the terminal.
  • Python 3.11 with Pyenv to manage multiple versions of Python without risking instability on the MacBook.
  • Poetry as the package and dependency manager for the Python language.
  • Make, a cross-platform tool to automate the compilation and installation of software.
  • Ollama, a framework coded in Go (83%) to run LLM models locally. Here we are under the MIT license, so it’s Open Source but with a lot of freedom, including the ability to integrate the code into proprietary code. We love it for its simplicity and the encouragement to innovate that this license provides, but we like it less for the risk of having our code looted by grave robbers, oops, Git repo looters.
  • PrivateGPT, the second major component of our POC, along with Ollama, will be our local RAG and our graphical interface in web mode. It provides us with a development framework in generative AI that is private, secure, customizable, and easy to use.

Knowledge required for this POC

In terms of knowledge, no prerequisites other than knowing how to use the command line in a terminal, an editor like Vim or Nano within this terminal, or even VS Code (Visual Studio Code), which is free and very useful if you do a little or a lot of development. It’s a source code editor developed by Microsoft. It is designed to be lightweight yet powerful, offering a comprehensive development experience for developers. Oh yes, some knowledge of Git but again the bare minimum: knowing how to clone a repository, and I’ll provide the commands if it’s new to you.

Let’s go!

As mentioned above, I assume you know how to manage the basic tools necessary for the operation of PrivateGPT (Homebrew, Python, Pyenv, Poetry…) or that you have consulted my first article to deploy them.

Deployment of Components

If you have already deployed LM Studio or Jan, PrivateGPT, HuggingFace_Hub by following my previous articles, then I suggest you create a new branch of your Git to run your tests for Ollama. This is what I will present here.

Creating a New Git Branch for PrivateGPT, Dedicated to Ollama

Navigate to your development directory /private-gpt

Ensure you are in your main branch “main”, your terminal should display the following:

private-gpt git:(main)

Otherwise, switch to your main branch with the command:

git checkout main

Create a new branch suitable for this project with the Ollama framework. In my case, I named it “ollama-local-embeddings” to maintain the logic of my previous two articles.

git checkout -b ollama-local-embeddings

You are now in your new functional branch, your terminal indicates:

private-gpt git:(ollama-local-embeddings)

Take this opportunity to update your Poetry environment if not done recently. At the time of writing this article, I am using version 1.8.3:

poetry self update

Verify the update by checking the Poetry version:

poetry --version

Installation of Ollama on macOS

Official Ollama website: https://ollama.com

Download the archive for macOS. Unzip it and then install the Ollama application in your applications.

Note that you can also install Ollama with Homebrew, in this case, the command is as follows:

brew install --cask ollama

At the time I wrote this documentation, we are at version 0.1.46 of Ollama, released on June 20 on its GitHub repository: https://github.com/ollama/ollama

If everything is well installed, just like with LM Studio or Jan from my previous articles, you will already be able to play with a local generative AI on your Mac.

First steps with Ollama and essential commands:

We will check that everything works and download at least the “Llama 3” LLM model to compare with what we have already done together.
Everything that follows takes place in your preferred terminal, in my case I use iTerm2.

Check the installed version:

ollama --version 

You should get something like:

ollama version is 0.1.46

To list your already downloaded or created LLM models in Ollama:

ollama list

If you have just installed Ollama and have not loaded anything, your list will be empty.

Download an LLM model, in our case, we choose the Llama3 model:

ollama pull llama3

Launch an LLM model:

ollama run llama3 

Note that if you haven’t already downloaded it, it will be downloaded first and then started, and depending on your bandwidth, the download may take a long time!

After a few seconds related to starting your Ollama instance with the loaded LLM model, you will get a prompt similar to this:

>>> Send a message (/? for help)

You can test!

Testing Ollama with the Llama3 model

To delete a model and free up space:

ollama rm modelname

For basic information, it’s simple:

/?

To switch between your models:

ollama run mistral

And to exit the command prompt:

/bye

Let’s now start the Ollama server:

You can start Ollama with the following command:

ollama serve

If, like me, you installed it from the archive available on the website, your Ollama instance is probably already running, and you will get the following message:
Error: listen tcp 127.0.0.1:11434: bind: address already in use

Let’s easily test if the server is working. To do this, open a web browser and type this:

http://localhost:11434/

You should get an “Ollama is running” message at the top left of your browser window like this:

Testing the Ollama server operation

Installation of dependencies for the operation of PrivateGPT with Ollama:

Let’s now install the Poetry dependencies necessary for the proper operation of PrivateGPT with Ollama. Note that the following command will remove the dependencies of our previous POCs with LM Studio and Jan. If you want to keep all three instances operational, then adapt your command to include all the dependencies.

poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant"

You should see the installation of all the necessary packages for the project running in your terminal.

Some useful explanations about the arguments, located between the double quotes, in this command line:

  • ui to use the Gradio UI
  • llms-ollama to connect to Ollama, which is, if needed to be reminded, a local Open Source LLM under the MIT license.
  • embeddings-ollama to be able to use a model in Ollama locally for vectorizing our documents. These models are used to convert words, phrases, or entire documents into number vectors. These vectors capture semantic aspects of the text, allowing algorithms to work with text using mathematical operations. I will try to write an article later on about what vectors are.
  • vector-stores-qdrant for storing the created vectors in the Qdrant format.

Configuration of PrivateGPT

We will now modify the configuration file suitable for our POC, namely the settings-ollama.yaml file, which you should find at the root of your private-gpt directory.

To edit the file, in your terminal, I recommend using the nano editor:

nano settings-ollama.yaml
server:
  env_name: ${APP_ENV:ollama}

llm:
  mode: ollama
  max_new_tokens: 512
  context_window: 3900
  temperature: 0.1     #The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual. (Default: 0.1)

embedding:
  mode: ollama

ollama:
  llm_model: llama3
  embedding_model: nomic-embed-text
  api_base: http://localhost:11434
  embedding_api_base: http://localhost:11434  # change if your embedding model runs on another ollama
  keep_alive: 5m
  tfs_z: 1.0              # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.
  top_k: 40               # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)
  top_p: 0.9              # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
  repeat_last_n: 64       # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
  repeat_penalty: 1.2     # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)
  request_timeout: 120.0  # Time elapsed until ollama times out the request. Default is 120s. Format is float.

vectorstore:
  database: qdrant

qdrant:
  path: local_data/private_gpt/qdrant

You will notice that in this test I am no longer using the embedding model BAAI/bge-small-en-v1.5 that I used in my previous two tests but rather the nomic-embed-text model. The reason is very simple, Ollama provides an ingestion engine usable by PrivateGPT, which was not yet offered by PrivateGPT for LM Studio and Jan, but the BAAI/bge-small-en-v1.5 model is not available at Ollama.

We will see in our tests if this changes the quality of the answers provided.

You can of course change LLM models and text embeddings, test other values for temperature, or the maximum number of tokens that the LLM should use.

Loading the embedding model in Ollama:

Earlier we downloaded the LLM model Llama3, but since Ollama will also serve us in the ingestion role to digest our documents and vectorize them with PrivateGPT, we need to download the model we declared in the configuration, namely: nomic-embed-text, with the appropriate command:

ollama pull nomic-embed-text

If you now list your models, you should have at least the following:

Result of the "ollama list" command after installing 2 models for this POC:

IMPORTANT NOTICE: If you have already followed one of my previous articles about my POCs on LM Studio and Jan, you likely have a Qdrant vector database that is set up for the BAAI/bge-small-en-v1.5 ingestion model we used. The vector dimensions are not the same from one model to another! To clear the Qdrant database, use this command in PrivateGPT:

make wipe

Which should return a result similar to this:

private-gpt git:(ollama-local-embeddings) ✗ make wipe
poetry run python scripts/utils.py wipe
17:46:59.739 [INFO ] private_gpt.settings.settings_loader — Starting application with profiles=[‘default’, ‘ollama’]
— Deleted /Users/netspirit/DEV/private-gpt/local_data/private_gpt/docstore.json
— Deleted /Users/netspirit/DEV/private-gpt/local_data/private_gpt/index_store.json
Collection dropped successfully.

Starting PrivateGPT

This is the big moment, if everything has gone well so far, there is no reason it shouldn’t work, suspense…

Still in your private-gpt directory, in the command line, start PrivateGPT with make, note that you will now always need to use this series of commands to start your PrivateGPT instance:

export PGPT_PROFILES=ollama 
make run

You can test that the Ollama model is properly loaded for your PrivateGPT with the command:

echo $PGPT_PROFILES

Which should return:

ollama

Using the PrivateGPT interface and testing the RAG

Connect using your preferred web browser to the address http://localhost:8001

You should see this:

Gradio UI web interface for PrivateGPT, here loaded with 3 PDF files of my course notes.

Except that you should not have any files present in the “Ingested Files” column. If you find your files from previous tests here, I invite you to delete them and then rerun the command to clear the contents of your Qdrant vector database:

make wipe

For my tests, I loaded into the RAG notes from my courses dedicated to leadership and mental health improvement, 3 PDF documents totaling approximately 30,000 words, just as I did for LM Studio and Jan.

If, while loading your own files, you get an error similar to:

ValueError: could not broadcast input array from shape (768,) into shape (384,)

Then reread the IMPORTANT NOTICE in the chapter “Loading the embedding model in Ollama” :-)

On META’s Llama 3 model, without any fine-tuning, the results are even more impressive than with my previous POCs under LM Studio and Jan. I am still as fast as a ChatGPT 4o.

A demo using the context of my course notes on management

Have fun comparing queries with the context of your documents (Query Files button activated) versus queries without the context (LLM Chat button activated).

In the following example, after clearing the history, I used the same query but without the context provided by my course notes.

Demo without using the context of my “course notes” files, note the relevance of the LLM Llama3 model and the response speed

Final test this time with the “Search Files” mode enabled, and I ask the same question. Here we measure the almost instantaneous speed of the response and its relevance (validated with my course notes).

Note that my course notes are in French, so the excerpts chosen by the AI to respond to me will be in French.

Demo in “Search files” mode with the same question

Of course, just like with LM Studio and Jan, or ChatGPT, Claude, Gemini, and others, the most important thing after the technical components is you and your ability to challenge the AI, but that’s another topic ;-)

For the future

A bit of buzz…

It turns out that for several months, I have been working with three colleagues on the development of a local, private, highly secure, minimalist, and Open Source AI solution (at least MIT licensed) for organizations that want to better exploit their data while remaining sovereign in the sense of “owners of their data and the code used”. We will officially launch this project in the coming weeks. Stay tuned… Aude, Cyril, Antoine, and Olivier.

Partager
Commentaires
Plus de nosia.

nosia.

Une vision éthique, humaniste et apaisée de l'IA générative libre et privée.

Super ! Vous vous êtes inscrit avec succès.

Bienvenue à nouveau! Vous vous êtes connecté avec succès.

Vous vous êtes abonné avec succès à nosia..

Succès ! Vérifiez dans votre e-mail le lien magique pour vous connecter.

Succès! Vos informations de facturation ont été mises à jour.

Votre facturation n'a pas été mise à jour.