Retrieval Augmented Generation in Practice: Building Search for Connected Notes

Chunking, reranking, and vector search implementation with LLMs in Zettelgarden

Nov 21, 2024

I wrote a few weeks ago about my experience with RAG (retrieval augmented generation). I have been experimenting more with it in Zettelgarden and have been having some interesting results.

Before I get into that, I’ve soft launched Zettelgarden, you can sign up and try it out at https://zettelgarden.com. There is a little landing page that doesn’t quite express how cool it is or any of the RAG features, I will need to improve that. Take a look at my previous article for an introduction to what Zettelgarden and RAG are.

Retrieval Augmented Generation (RAG)

I think RAG is conceptually simple. The point is to use LLMs to retrieve information, to build a long-term memory into software that has no memory. The main steps of RAG are:

LLMs use some magic to create special numbers that represent each piece of text (“embeddings”)
When you want to search, your written query gets turned into more special numbers, which are then compared to what is already in the database, returning the top results. (“vector search”)

That’s basically it. Putting aside the black box of how embeddings work, most attempts at RAG at the core have that basic workflow.

I am building RAG into Zettelgarden to both improve how search works, but also to build into the system a method of generating new ideas. Zettelkastens work by encouraging new ideas to spring up from serendipity, but also by linking cards together and putting visual reminders on the cards. My bet is that LLMs can make a big impact on suggesting potential and interesting connections. At the worst case, LLMs are giving me the ability to generate incredible search results without a lot of code.

A basic overview of how Zettelgarden’s search works

There are a number of different ways you can tweak the way RAG works

You can improve the way embedding works by chunking differently.
You can improve the search results by reranking the results post vector search
You can also preprocess the query and actually query the database in a different way than the user input.

Chunking

I mentioned in my last article the struggles I was having with chunking. I naively thought you could dump in a podcast, calculate embeddings for the whole text, then search based on that. The challenge is that while embedding is a black box, its not actually magic. I have been using vectors with 1024 floating points for embedding - there are only 1024 numbers to represent the text. The LLM is able to embed significantly more meaning in 1024 numbers if the input text is 100 characters than if it is 100,000.

There are different approaches to chunking. I’ve taken a simple but less optimal approach of chunking based on sentences. Each period marks a new chunk. These chunks are then fed into the LLM to create embeddings. The code that does this is actually not all that interesting. It’s essentially split string by periods, feed each chunk into the LLM, store in database. The wrapper function looks like this:

func (s *Handler) ChunkEmbedCard(userID, cardPK int) error {
	chunks, err := s.GetCardChunks(userID, cardPK)
	if err != nil {
		log.Printf("error in chunking %v", err)
		return err
	}
	embeddings, err := llms.GenerateEmbeddingsFromCard(s.DB, chunks)
	if err != nil {
		log.Printf("error generating embeddings: %v", err)
		return err
	}
	log.Printf("chunks %v - count %v", len(chunks), len(embeddings))
	llms.StoreEmbeddings(s.DB, userID, cardPK, embeddings)
	return nil
}

Reranking

The results that are returned based on improved chunking are much better, but not perfect still. The next technique I have been experimenting with is reranking. Reranking is simpler than it seems as well: given the vector search results, ask an LLM to choose which ones most closely match the query. This is conceptually simple as well, in the following function I ask the LLM to provide a series of floating point numbers on how relevant it thinks each of the results are.

func RerankResults(c *openai.Client, query string, input []models.CardChunk) ([]float64, error) {
	log.Printf("start")
	summaries := make([]string, len(input))
	for i, result := range input {
		// Create a brief summary of each result
		summaries[i] = fmt.Sprintf("%d - %s - %s",
			i+1,
			result.Title,
			result.Chunk)
	}

	prompt := fmt.Sprintf(`Given the search query "%s", rate the relevance of each document on a scale of 0-10.
Consider how well each document matches the query's intent and content.
Only respond with numbers separated by commas, like: 8.5,7.2,6.8

Documents to rate:
%s`, query, strings.Join(summaries, "\n"))
	resp, err := c.CreateChatCompletion(
		context.Background(),
		openai.ChatCompletionRequest{
			Model: "meta-llama/llama-3.2-3b-instruct:free",
			Messages: []openai.ChatCompletionMessage{
				{
					Role:    "system",
					Content: "You are a search result scoring assistant. Only respond with comma-separated numbers.",
				},
				{
					Role:    "user",
					Content: prompt,
				},
			},
		},
	)
	scores := parseScores(resp.Choices[0].Message.Content)
	if err != nil {
		return []float64{}, err
	}
	return scores, nil

}

I have some example results from what I have been able to achieve here.

I have three pictures here with three different queries, showing the ability to handle returning different sets of data. I should caution: the results are limited to what I have in my database of notes already.

I think these results are pretty good, but still not quite there. You can see the results of the chunking (each result has an italics chunk of text) and of the ranking (I print out the ranking, maybe just for now).

These results are reasonably related to the search queries (I’m not sure how involved Gil Amelio was in the iPhone, but this is probably due to a lack of data in the database).

Problems

My slightly simple reranking prompt tends to return either highly related or highly unrelated. Looking at the screenshots above, there are a lot of results ranked quite highly that I don’t think really are that highly related. For example, in the “who invented the integrated circuit” prompt, the first result is most highly related, while the second result is actually answering the question.
- Part of the problem with this is the way Zettelgarden uses references. Here is the actual card. Note that its not actually about integrated circuits, but references another card that is about circuits. This card is then only marginally related to the prompt, but because it helpfully lists the title of the card, the LLM thinks this is highly related.
- I think the value of Zettelgarden will come from solving this problem. How can we use references to inform ranking instead of naively relying on text?
The main tradeoff with vector search is disk space and processing time. Calculating vectors takes a long time, although network overhead is the bottleneck here: sending each chunk to the LLM takes a request which all happen sequentially. This is something relatively straightforward to solve, I just haven’t done it yet. Chunking and calculating only really happens upon card creation or edit, so this is mostly frontloaded.
- In terms of disk space, embeddings have exploded the amount of storage required. Before I added this to Zettelgarden, the database (with around 8000 cards) was maybe 20mb. Now, with a similar amount of cards, the database is 400mb. That is a 20x increase in storage requirements. 400mb itself is a small amount, but that’s 0.2mb per card
Embeddings are dependent on the embedding model. Embeddings from different models are completely different, so they are impossible to compare. Therefore, switching models is a big deal because you need to recalculate all of the embeddings. This is incurring a significant cost, the last time I tried this it took an hour and a half
- I recalculated all of the embeddings when changing how chunking worked. I don’t think this is something I would need to do every time chunking changed, but I had switched from a naive approach to a per-sentence approach which greatly increased the number of chunks (and therefore the number of embeddings). If I tweak the algorithm, I can still rely on the existing embeddings, even if they are slightly less optimal.
- I’m just fooling around and have no users of Zettelgarden except for myself, so this isn’t a big deal, but because embeddings from different models can’t be compared, search results will be meaningless until it completes. Therefore, this isn’t really something that is easy to do in production
Some LLMs are very slow for reranking. I’ve settled on llama3.2 for now running off of openrouter.ai because its fast, but there is still a noticable delay while getting results. This wasn’t an issue with plain keyword search.

Potential For Future Improvements

I have not tried experimenting with this, but I think there are improvements that can be made in terms of preprocessing queries. For a query like “integrated circuits and iPhones”, I wonder if there would be a benefit in using an LLM to break that query up into “integrated circuits” and “iPhones”, then vector search based on that, then rerank the results together. There might also be a benefit in getting an LLM to expand the query to include related terms.

The real magic with Zettelgarden will come from being able to make use of the existing “graph structure”. By this, I mean Zettelgarden already understands the relationship between cards (some cards have children, parents, children of children, etc)

As an aside, I looked into how LightRAG works and was a little disappointed. It talks about having a graph structure, but really what it is is precomputing what it thinks each ‘entity’ is related to, then feeding that into the LLM with it. I’m disappointed by this because it just requires a really long prompt, not a lot of magic going on.

Conclusions

RAG is proving to be a powerful addition to Zettelgarden, though it's still early days. The combination of vector search and reranking is already producing interesting results, even if they're not perfect yet. What excites me most is the potential to leverage Zettelgarden's existing graph structure to create something more sophisticated than basic RAG implementations.

The technical challenges - from chunking to storage requirements to embedding model dependencies - are significant but not insurmountable. What's become clear through this experimentation is that the real value won't come from just implementing RAG as others have done, but from finding ways to make it work specifically for a Zettelkasten system.

The goal isn't just to find relevant content, but to surface meaningful connections that can spark new ideas.

If you're interested in trying it out yourself, head over to https://zettelgarden.com. I'd love to hear about your experiences and ideas for how we can make this tool even better.

Nick Savage

Discussion about this post