Using Azure Database for PostgreSQL as a vector store

When we build LLM applications, there is always a recurring question: “What vector store will we use?”. In Azure, there are several native solutions. Some of them were discussed in previous posts.

  • Azure Cognitive Search: supports vector search but also hybrid search with semantic reranking as discussed here
  • Azure Redis Cache Enterprise: check my blog post
  • Azure Cosmos DB for MongoDB Core: see Microsoft Learn

In addition to the above, you can of course host your own vector database in a container such as Qdrant or others. You can install these on any service that supports containers such as App Service, Container Instances, Container Apps, or Kubernetes.

Using PostgreSQL

If you are familiar with Azure Database for PostgreSQL flexible servers, you can use it as a vector store, as long as you install the vector extension. This extension can be enabled in all compute tiers. I installed PostgreSQL and set the compute tier to Burstable with size Standard_B1ms (1 vCore, 2GB). This is great for testing and will cost around 20 euros per month with 32GB of storage. For production use, the monthly cost will start from about 150 euros at the lowest General Purpose tier.

PostrgreSQL flexible server with lowest compute tier

After deployment, you need to enable the vector extension. In Server Parameters, search for azure.extensions and select VECTOR from the list. Then click Save.

VECTOR extension added

When done, grab the connection details from the Connect pane:

Connection details

In pgAdmin, register the server with the above details. Connect to the server and create a database. Ensure you configure the firewall settings in Azure to allow your IP address to connect to the server.

Database creation in pgAdmin

Inside the database, go to Extensions and add the vector extension:

vector extension added to the database

Note: in the code below, we will use LangChain. LangChain will try to enable the vector extension if it is not enabled

Note: if you do not want to install pgAdmin, you can create the database from the Azure Portal or use the Azure CLI.

Working with the vector store

Although you can create and query tables that contain vectors with plain SQL, we will use LangChain as a higher-level library that takes care of many of the details for us.

Take a look at the following Python code that creates a few embeddings (vectors) and then uses RAG (retrieval augmented generation) to answer a question with OpenAI’s text-davinci-003 model.

Note: the code is on Github as well

import os
import getpass

# read from .env file
from dotenv import load_dotenv
load_dotenv()

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGVector
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI


loader = TextLoader("./state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()


pgpassword = os.getenv("PGPASSWORD", "")
if not pgpassword:
    pgpassword = getpass.getpass("Enter pgpassword: ")

CONNECTION_STRING = f"postgresql+psycopg2://pgadmin:{pgpassword}@pg-vec-geba.postgres.database.azure.com:5432/pgvector"

COLLECTION_NAME = "state_of_the_union_test"

# if you run this code more than once, you will duplicated vectors
# no upserts
db = PGVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING
)

retriever = db.as_retriever()

query = "What did the president say about Ketanji Brown Jackson"

# LLM will default to text-davinci-003 because we are using a completion endpoint
# versus a chat endpoint
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)

answer = qa.run(query)

print(answer)

The code above requires a .env file with the following content:

OPENAI_API_KEY=OPENAI_API_KEY
PGPASSWORD=PASSWORD_TO_POSTGRES

You will also need the State of the Union text file from here.

Before running the code, install the following packages:

pip install pgvector
pip install openai
pip install psycopg2-binary
pip install tiktoken

The code does the following:

  • Import required libraries: important here is the PGVector import
  • Loads the text file and splits it into chunks: chunking strategy is not too important here; just use chunks of 1000 characters
  • Create an instance of type OpenAIEmbeddings, later used to create a vector per chunk for storage in PostgreSQL; it will also be used when doing queries with a retrieval QA chain (see below); uses text-embedding-ada-002 embedding model
  • Construct the connection string for later use and set a collection name: collections are a way to store vectors together; the collections you create are kept in a table and each vector references the collection
  • Create an instance of PGVector with PGVector.from_documents: this will create/use tables to hold the collection(s) and vectors for you; all chunks will be vectorized and stored in a table; we will take a look at those tables in a moment; in a real application, you would reference existing tables and another process would create/update the vectors
  • Create a retriever (qa) from the PGVector instance for use in a retrieval QA chain
  • Run a query and print the answer: the qa.run (query) line does the n-nearest neighbor vector search in PostgreSQL (via the retriever), creates a meta-prompt with the relevant context, and returns the OpenAI model response in one step

In the PostgreSQL database, the above code creates two tables:

Tables created by LangChain to store the vectors

The collection table contains the collections you create from code. Each collection has a unique ID. The embedding table contains the vectors. Each vector has a unique ID and belongs to a collection. The fields of the embedding table are:

  • uuid: unique ID of the vector
  • collection_id: collection ID referencing the collection table
  • embedding: a field of type vector that stores the embedding (1536 dimensions)
  • document: the chunk of text that was vectorized
  • cmetadata: a JSON field with a link to the source file
  • custom_id: an id that is unique for each run

Note that when you run the sample Python code multiple times, you will have duplicated content. In a real application, you should avoid that. The process that creates and stores the vectors will typically be separate from the process that queries them.

⚠️ Important: Today, LangChain cannot search over all vectors in all collections. You always need to specify the collection to search. If you do need to search over all vectors, you can use SQL statements instead.

The search has the following properties:

  • Distance strategy: cosine similarity; the pgvector extension also supports L2 distance and inner product; the code above uses the text-embedding-ada-002 embeddings model by default; with that model, you should use cosine similarity; LangChain uses cosine similarity as the default for PGVector so that’s a match! 👏
  • Exact nearest neighbor search: although this provides perfect recall, it can get slow when there are many vectors because the entire table is scanned; the extension supports the creation of indexes to perform an approximate nearest neighbor search using IVFFLat or HNSW; see pgvector on GitHub for more details and also this article from Crunchy Data.

Note: most other vector databases use HNSW as the index type (e.g., Azure Cognitive Search, Qdrant, …); unlike IVFFLat you can create this index without having any vectors in your database table; at the time of writing (end of September 2023), the version of the vector extension on Azure was 0.4.1 and did not support HNSW; HNSW requires version 0.5.0 or higher

Conclusion

Azure Database for PostgreSQL with the vector extension is an interesting alternative to other vector database solutions in Azure. This is especially the case when PostgreSQL is your database of choice! In this post, we have shown how LangChain supports it with a simple example. If you do not use LangChain or other libraries, you can simply use SQL statements to create and search indexes as documented here.

The drawback of using PostgreSQL is that you need to know a bit more about exact and approximate nearest neighbor searches and the different index mechanisms. That’s actually a good thing if you want to create production applications with good performance. For a simple POC with not a lot of data, you can skip all of this and perform exact searches.

Besides the free tier of Azure Cognitive Search, the configuration above is the service with the lowest cost for POCs that need vector search. On top of that, the cheapest PostgreSQL option has more storage than Cognitive Search’s free tier (32GB vs. 50MB). Adding more storage is easy and relatively cheap as well. Give it a go and tell me what you think!

Querying Postgres with GraphQL

I wanted a quick and easy way to build an API that retrieves the ten latest events from a stream of data sent to a TimescaleDB hypertable. Since such a table can be queried by any means supported by Postgres, I decided to use Postgraphile, which automatically provides a GraphQL server for a database.

If you have Node.js installed, just run the following command:

npm install -g postgraphile

Then run the following command to start the GraphQL server:

postgraphile -c "postgres://USER@SERVER:PASSWORD@SERVER.postgres.database.azure.com/DATABASE?ssl=1" --simple-collections only --enhance-graphiql

Indeed, I am using Azure Database for PostgreSQL. Replace the strings in UPPERCASE with your values. I used simple-collections only to, eh, only use simple collections which makes it, well, simpler. 👏👏👏

Note: the maintainer of Postgraphile provided a link to what simple-collections actually does; take a look there for a more thorough explanation 😉

The result of the above command looks like the screenshot below:

GraphQL Server started

You can now navigate to http://localhost:5000/graphiql to try some GraphQL queries in an interactive environment:

GraphiQL, enhanced with the –enhance-graphiql flag when we started the server

In the Explorer to the left, you can easily click the query together. In this case, that is easy to do since I only want to query a single table an obtain the last ten events for a single device. The resulting query looks like so:

{
allConditionsList(condition: {device: "pg-1"}, orderBy: TIME_DESC, first: 10) {
time
device
temperature
}
}

allConditionsList gets created by the GraphQL server by looking at the tables of the database. Indeed, my database contains a conditions table with time, device, temperature and humidity columns.

To finish off, let’s try to obtain the data with a regular POST method to http://localhost:5000/graphql. This is the command to use:

curl -X POST -H “Content-Type: application/json” -d ‘{“query”:”{\n allConditionsList(condition: {device: \”pg-1\”}, orderBy: TIME_DESC, first: 10) {\n time\n device\n temperature\n }\n}\n”,”variables”:null}’ http://localhost:5000/graphql

Ugly but it works. To be honest, there is some noise in the above command because of the \n escapes. They are the result of me grabbing the body from the network traffic sent by GraphiQL:

Yes, lazy me grabbing the request payload from GraphiQL and not cleaning it up 😉

There is much, much, much more you can do with GraphQL in general and PostGraphile in particular but this was all I needed for now. Hopefully this can help you if you have to throw something together quickly. In a production setting, there is of course much more to think about: hosting the API (preferably in a container), authentication, authorization, performance, etc…

Further improvements to the IoT Hub to TimescaleDB Azure Function

In the post Improving an Azure Function that writes IoT Hub data to TimescaleDB, we added some improvements to an Azure Function that uses the Event Hub trigger to write messages from IoT Hub to TimescaleDB:

  • use of the Event Hub enqueuedTime timestamp instead of NOW() in the INSERT statement (yes, I know, using NOW() did not make sense 😉)
  • make the code idempotent to handle duplicates (basically do nothing when a unique constraint is violated)

In general, I prefer to use application time (time at the event publisher) versus the time the message was enqueued. If you don’t have that timestamp, enqueuedTime is the next best thing.

How can we optimize the function even further? Read on about the cardinality setting!

Event Hub trigger cardinality setting

Our JavaScript Azure Function has its settings in function.json. For reference, here is its content:

{
"bindings": [
{
"type": "eventHubTrigger",
"name": "IoTHubMessages",
"direction": "in",
"eventHubName": "hub-pg",
"connection": "EH",
"cardinality": "one",
"consumerGroup": "pg"
}
]
}

Clearly, the function uses the eventHubTrigger for an Event Hub called hub-pg. In connection, EH refers to an Application Setting which contains the connections string to the Event Hub. Yes, I excel at naming stuff! The Event Hub has defined a consumer group called pg that we are using in this function.

The cardinality setting is currently set to “one”, which means that the function can only process one message at a time. As a best practice, you should use a cardinality of “many” in order to process batches of messages. A setting of “many” is the default.

To make the required change, modify function.json and set cardinality to “many”. You will also have to modify the Azure Function to process a batch of messages versus only one:

Processing batches of messages

With cardinality set to many, the IoTHubMessages parameter of the function is now an array. To retrieve the enqueuedTime from the messages, grab it from the enqueuedTimeUtcArray array using the index of the current message. Notice I also switched to JavaScript template literals to make the query a bit more readable.

The number of messages in a batch is controlled by maxBatchSize in host.json. By default, it is set to 64. Another setting,prefetchCount, determines how many messages are retrieved and cached before being sent to your function. When you change maxBatchSize, it is recommended to set prefetchCount to twice the maxBatchSize setting. For instance:

{
"version": "2.0",
"extensions": {
"eventHubs": {
"batchCheckpointFrequency": 1,
"eventProcessorOptions": {
"maxBatchSize": 128,
"prefetchCount": 256
}
}
}
}

It’s great to have these options but how should you set them? As always, the answer is in this book:

Afbeeldingsresultaat voor it depends joke

A great resource to get a feel for what these settings do is this article. It also comes with a Power BI report that allows you to set the parameters to see the results of load tests.

Conclusion

In this post, we used the function.json cardinality setting of “many” to process a batch of messages per function call. By default, Azure Functions will use batches of 64 messages without prefetching. With the host.json settings of maxBatchSize and prefetchCount, that can be changed to better handle your scenario.