Using Vector Databases For Content Lookup

What are vector databases, which database provider is my favorite, and how do I use it to power my chatbot?

Using Vector Databases For Content Lookup

I wanted to share my process for building out the chatbot I've deployed at https://chat.leoguinan.ai. So this post will be the first in a series that I'm doing on the architecture.

In this post, I wanted to share some thoughts on vector databases.

In case you aren't familiar with vector databases, they are great for storing embeddings, which are the foundation of Large Language Models. Basically, all data can be converted to high-dimensional vectors that contain all sorts of information about the data compressed into numbers. OpenAI's embedding model uses a 1536-dimension vector, so when you compress text into these dimensions, it becomes possible to do all sorts of really cool things.

But my favorite use case is search. It becomes super simple to search across semantic meaning instead of just doing full-text matching. And, because of the math that becomes possible, it's super fast to search across a large embedding space to determine the closest match.

I talked about this more in this video:

This is how I'm searching across my content library with my chatbot. I've turned my newsletter, blog posts, podcasts, and Youtube videos into embeddings, and then run searches across that, find the best matches, and display that content in the chatbot.

A vector database holds those embeddings so I can easily search them, and my personal favorite is Chroma.

I like Chroma for a few main reasons:

  1. It's open-source. When given the option, I tend to pick the open-source product over the closed-source.
  2. It can be run in-memory, persisted to disk, or run on a server. This seems like a lot of options that may or may not be needed, but with the speed of changes in the AI ecosystem, I've found it necessary to re-run the same processes over and over. These options allow me to quickly test something in memory or on my local filesystem. Then I can make sure I've got things right before then re-running the embeddings process to insert into my production server.
  3. I've chatted with the team, and I like where they are going with things. They see the world in similar ways to me, and I really like some of the research they've been putting out. Here's a really interesting paper they recently released on chunking strategies for retrieval.
  4. They aren't Pinecone. (Pinecone really annoyed me when they offered me $100 in serverless credits, so I signed up and started putting things in their serverless indexes, and then I got bills showing that $2 of credits were used because the credits were only for a specific piece of the fees. Really left a bad taste in my mouth.)

Typically, I like to separate out the content types into separate collections. So I've got individual collections for my Substack, my Youtube channel, and this blog. I'm also planning on adding another collection for external sources, such as podcast interviews I've done, but that's a future enhancement.

That way, I can find the best matches across different sources and compile results by taking the top matches over a given threshold, and then ranking them. This ensures that I can have variety in content types and give users a better choice of content type based on their preferences, not just what content I've made more of in a given format.