Simple RAG with pure Java and Langchain4j: Use locally hosted Ollama to talk to your documents

you can also read this article in Medium -> click here

Talk to your documents using a simple RAG pipeline, which we will build with the help of a locally-hosted Ollama model, pure Java, and Langchain4j! 📚 🧠 🤖

All of this will run on our local computer: no API keys or subscriptions needed — just pure experimenting with AI integration in a very simple Java application. I'll also try to keep things beginner-friendly!

What is RAG?

Now, let's see what Wikipedia has to say about RAG:

Retrieval-augmented generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information. With RAG, LLMs do not respond to user queries until they refer to a specified set of documents. These documents supplement information from the LLM's pre-existing training data. This allows LLMs to use domain-specific and/or updated information that is not available in the training data.

I think we can simplify this in the spirit of this tutorial, to keep things beginner-friendly and specific to our goal here:

RAG lets language models use information from your own documents to answer your questions.

So, without RAG, the models rely only on what they were trained on, but with RAG, they can use additional external or private information from documents we define.

RAG combines two main tools: an embedding model and a chat model. The embedding model turns text into vectors called embeddings. These are really just numerical representations of the text — think of a list or an array of numbers (e.g. [0.12, -1.4, 3.7, …]). The embeddings are stored in an embedding store (this can be an in-memory structure or a vector database). When the user asks a question, the system can quickly find the relevant information from the embedding store and pass it to the chat model, which reads it and uses it to generate an informed response.

Now that we have some understanding of what's going on, let's create a more detailed plan:

Load the document we want to use for our RAG Pipeline
Split the document into chunks
Initialize the Embedding Model
Create the embeddings
Store the embeddings into an Embedding Store
Create a content retriever, using the embedding model and the embedding store
Initialize the (Ollama) Chat Model
Create the final AI Assistant, using the chat model and the content retriever

Prerequisites

Install Ollama on our machine

Just go to their website: https://ollama.com/download.

Download the appropriate file for your OS and install it like you would install any other application. This installs the ollama command line tool, and you will be able to run ollama models locally with just a simple command like:

ollama run llama3.2:1b

which will run the smallest available ollama model right now (~1.3 GB). It's not a very good model, but it's small and will run very fast locally — perfect for our use case. You can later experiment with all the free models Ollama provides — see here.

This will be the Chat Model, answering our questions.

So after you run the previous command in your terminal, you should see something like this, which downloads the model, runs it, and opens a prompt where you can chat with the LLM.

But this also exposes an HTTP API on localhost, which we can interact with. This will happen under the hood with the langchain4j extension.

Now do the same for nomic-embed-text:latest which will be our Embedding Model.

ollama run nomic-embed-text:latest

We now have both the Embedding Model (nomic-embed-text:latest) and the Chat Model (llama3.2:1b) running locally on our machine!

Set up Java Project

Create a Java Gradle project and add the following two dependencies in the build.gradle file:

dependencies {
    ...
 
    implementation 'dev.langchain4j:langchain4j:1.0.1'
    implementation 'dev.langchain4j:langchain4j-ollama:1.0.1-beta6'
 
    ...
}

Choose documents to talk to

I asked ChatGPT to create some descriptive text about a fictional animal that we can use as a Document and ask questions about it. That's what it came up with:

The Fluffnook (Pellucia memorialis)

The Fluffnook is a small, rabbit-sized mammal native to the misty highlands of the fictional Veridian Mountains. Standing approximately 12 inches tall and weighing 3–4 pounds, this creature is covered in dense, iridescent fur that shifts from deep purple to silver depending on the angle of light. The Fluffnook's most distinctive feature is its oversized, luminescent ears that can rotate 180 degrees and glow with a soft blue light during twilight hours. These nocturnal creatures are herbivorous, feeding primarily on moonbell flowers and crystalline moss that grows on volcanic rocks. Fluffnooks are known for their exceptional memory — they can recall the exact location of food sources for up to seven years and have been observed creating intricate mental maps of their territory spanning over 50 square miles. They live in small family groups of 4–6 individuals and communicate through a series of melodic chirps that can be heard up to two miles away on clear nights. The average lifespan of a Fluffnook is 25–30 years, and they reproduce only once every three years, typically giving birth to twins during the spring equinox.

I liked it — we'll ask our model questions about the Fluffnook!

I also asked it to generate some questions that we can ask right away. I found those three to be enough for us to check if the model really uses this information:

How much does a typical Fluffnook weigh?
Where do Fluffnooks live?
What do Fluffnooks eat?

Now let's start with the fun part!

Implementation

Create a java class with the good old public static void main(String[] args) method. (Can't wait for Java 25 btw…)

Now, let's follow the steps from our plan:

1. Load the document we want to use for our RAG Pipeline

var document = loadDocument( getResourcePath( "FluffnookKnowledgeBase.txt" ),  new TextDocumentParser() );

2. Split the document into chunks

var splitter = DocumentSplitters.recursive( 400, 50 );
var segments = splitter.split( document );

This here is very important, because how we split the document can have a great effect on the results we get from the model. You may need to experiment with different values to find the best configuration for your use case.

The first argument is the Chunk Size, and the second one — Chunk Overlap. You can check how our text gets split up with ChunkViz (which is really a great tool to visualize how we split our texts):

3. Initialize the Embedding Model

var embeddingModel = initEmbeddingModel();

Here the function (maybe create like Utils class or something, to keep things tidy):

static EmbeddingModel initEmbeddingModel()
{
    return OllamaEmbeddingModel.builder()
            .baseUrl("http://localhost:11434")
            .modelName("nomic-embed-text:latest")
            .build();
}

4. Create the embeddings

List<Embedding> embeddings = embeddingModel.embedAll( segments ).content();

5. Store the embeddings in an Embedding Store

I’ll use an InMemoryEmbeddingStore here for simplicity, but using a vector database is also very straightforward. I might write a separate article about this.

EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
embeddingStore.addAll( embeddings, segments );

6. Create a content retriever, using the embedding model and the embedding store

var contentRetriever = initContentRetriever( embeddingStore, embeddingModel );

Here is the helper function:

static ContentRetriever initContentRetriever(EmbeddingStore<TextSegment> embeddingStore, EmbeddingModel embeddingModel )
{
    return EmbeddingStoreContentRetriever.builder()
            .embeddingStore( embeddingStore )
            .embeddingModel( embeddingModel )
            .maxResults(3)
            .minScore(0.5)
            .build();
}

7. Initialize the (Ollama) Chat Model

var chatModel = initOllamaChatModel();

static ChatModel initOllamaChatModel()
{
    return OllamaChatModel.builder()
            .baseUrl("http://localhost:11434")
            .modelName("llama3.2:1b")
            .logRequests(true)
            .build();
}

8. Create the final AI Assistant, using the chat model and the content retriever

var assistant = AiServices.builder(RagDocumentAssistant.class)
        .chatModel(chatModel)
        .contentRetriever(contentRetriever)
        .build();

Now, here we pass an Interface, which we create manually. We can even pass it a System Prompt to give some more hints to our model. Here is the code:

public interface RagDocumentAssistant {
    @SystemMessage("You are a helpful assistant that answers questions about a fictional animal called Fluffnook.")
    String answer(@UserMessage String question);
}

And that’s really all we need to start chatting with our model about the Fluffnooks!

Let’s finish this off with the code for handling the user input and actually asking the model:

try (var scanner = new Scanner(System.in)) {
    while (true) {
        System.out.print("\nEnter your question (or 'exit' to quit): ");
        var question = scanner.nextLine();
 
        if ("exit".equalsIgnoreCase(question)) {
            break;
        }
 
        System.out.println("Thinking...");
        var answer = assistant.answer(question);
        System.out.println("Answer: " + answer);
    }
}
System.out.println("Goodbye!");

Now let’s run the program and ask some questions!

Looks good! We got all the answers we needed for the specific information we provided in the document.

Conclusion

Langchain4j makes RAG pretty simple! This is a very basic example of course. There are many advanced RAG strategies that are usually used in real-world applications, but we managed to get some good results even with our simple approach.

Thanks for reading!

Loading comments...