Retrieval-Augmented Generation (RAG) is an advanced framework in natural language processing that significantly enhances the capabilities of chatbots and other conversational AI systems. It merges two critical components —retrieval and generation— to deliver more accurate, contextually relevant, and informative responses.
In this blog post we will build a RAG chatbot that uses 7B model released by Mistral AI on Ollama as the LLM model, and Upstash Vector as the retriever. Both Mistral 7B on Ollama and the RAG Chatbot will be running on Fly.io.
Library for building AI-powered streaming text and chat UIs.
Create Upstash Vector Database
Upstash Vector is a serverless vector database designed for working with vector embeddings. We will store the embeddings generated from messages in Upstash Vector.
Let's login console and click to "Create Index". Creating index is pretty simple. Since the size of the embeds that our Embedding model will produce is 768, we just need to set the dimensions to 768 to make our index compatible.
When we click "Next", we will need to select a payment plan. We can go with the free plan for this demo.
Deploy Ollama on Fly.io
Now, let's make LLM ready. We will deploy the LLM on fly.io by using Ollama.
We first need to create a fly.io account. Later on, we should setup fly.io in our local machine. To do that, we need to install flyctl, which is a command line interface to deploy projects to fly.io platform, by following the flyctl installation guide.
After installing flyctl, now we should create a fly.toml file, which will be used for deploying Ollama to fly.io. Let's create this file with the content below.
To deploy Ollama by this file to fly.io, we need to open terminal, move into the directory of that file and run the following command.
This command will open the fly.io website to complete the configuration of the deployment. On the website, the important thing for us is the VM size. It should be larger than performance-2x to be able to run LLM.
After completing the deployment, we can check if Ollama is running or not by opening the endpoint that fly.io gave, which is https://ollama-mistral.fly.dev for the demo. It should return Ollama is running string.
Now, we will add Mistral 7B model and Nomic embbeding model to Ollama. We will use Mistral 7B model as the text generator LLM, Nomic embedding model as the embed extractor. For adding these models to Ollama running on fly.io, we should run the following commands from our terminal.
curl -X POST https://ollama-noah2.fly.dev/api/pull -d '{ "model": "nomic-embed-text" }'
These commands will pull the models and run them on the fly.io machine.
We can quickly test the models by running following commands on the terminal.
The first command should return a text response, generated by Mistral 7B model and the second command should return the embeds of the prompted text.
Create Nextjs Application
We have our LLM ready. Now we can start working on our chatbot. Let's create a Nextjs app first.
npx create-next-app@latest mistral-chat-app
Now, we will install dependencies needed for our chatbot and its UI.
pnpm install @upstash/vector langchain @langchain/community ai zod react-toastify
We should also create .env file in the root directory of the Nextjs project to store Upstash Vector credentials and Ollama fly.io base endpoint.
To get Upstash Vector credentials, we should open Upstash Vector Console back, open the Index that we created and copy the .env configs under Connect.
Copy these configurations and paste them into the .env file in the project. Also append the following fly.io base url as well.
OLLAMA_BASE_URL="https://ollama-noah2.fly.dev"
Barebones Nextjs app for us is ready.
Implement Chatbot API
In the chatbot, we should first create POST endpoint. The input of this endpoint should be the message of the user and output should be the message generated by LLM running on Ollama.
First, we will create route.ts under app/api/chat. This file will have the POST endpoint with /api/chat extension in the url.
In this file, we define the question and answer prompts. With these prompts, we give the context and the question to the LLM and expect it to answer the question given that context. We ask the LLM to be an expert on Retrieval-Augmented Generation (RAG) chatbots and give information about RAG chatbots.
After defining the prompts, we set up the ChatOllama, OllamaEmbeddings and UpstashVectorStore by using the @langchain/community. These are all implemented in Langchain community package and help us to build our chatbot API with very small effort.
We also implement this POST API with a chain to stream the output of LLM response to not block the UI while waiting the whole response.
To test the POST API quickly, we can send a request on the terminal after running the app.
The output should be a stream of objects displayed in JSON format on the terminal.
Implement Chatbot UI
We have the POST endpoint ready. Now we need the UI for our chatbot. For the UI, we will use useChat of Vercel AI SDK to display the messages generated by LLM.
Let's open pages.tsx file to build our chat window. The following code will implement a very basic chatbot UI for this demo.
Vercel AI SDK makes the streaming for chat apps very easy.
The file above was the ChatWindow of the Chatbot, which displays messages in message bubbles. Now we need Chat message bubbles. The following Typescript file will be very basic message bubble.
Our chatbot application is ready! We can test it by openning localhost:3001 on any browser.
Deploy the Chatbot to Fly.io
Finally, we will deploy the chatbot to fly.io, as we did for Ollama.
flyctl cli can recognize if the project is Nextjs. Therefore, we only need to run the same command that we run to deploy Ollama.
Let's open the terminal again, move into the root directory of the project and run the following command.
fly launch
Again, this command will open the fly.io website for further configuration of the deployment. We can use the default machine size for this demo project.
After the deployment completed, we can reach to the RAG chatbot from the URL given by fly.io. In this demo project, it will be https://mistral-chat-app.fly.dev.
Conclusion
At the end of this blog post, we have two apps running on fly.io.
The first one is Ollama, which runs Mistral 7B LLM model to generate responses to given questions in particular context and Nomic Embeddings model to extract embeddings of given text input.
The second one is the RAG Chatbot, which is an application written in Next.js using the Vercel AI SDK and Langchain. This application interacts with Ollama running on fly.io to generate text from Mistral 7B model. The app also interacts with Upstash Vector to store the embeddings and retrieve from the vector index.
The project implemented in this blog post is just a proof of concept. It has a very basic UI and very small amount of resources. The project can be much improved by developing a better UI and adding more resources to make the app perform better.