Finding images with text and image queries with the help of GPT-4 Vision

With the gpt-4-vision-preview model available at OpenAI, it was time to build something with it. I decided to use it as part of a quick solution that can search for images with text, or by providing a similar image.

We will do the following:

  • Describe a collection of images. To generate the description, GPT-4 Vision is used
  • Create a text embedding of the description with the text-embedding-ada-002 model
  • Create an image embedding using the vectorizeImage API, part of Azure AI Computer Vision
  • Save the description and both embeddings to an Azure AI Search index
  • Search for images with either text or a similar image

The end result should be that when I search for desert plant, I get an image of a cactus or similar plant. When I provide a picture of an apple, I should get an apple or other fruit as a result. It’s basically Google image and reverse image search.

Let’s see how it works and if it is easy to do. The code is here: https://github.com/gbaeke/vision. The main code is in a Jupyter notebook in the image_index folder.

A word on vectors and vectorization

When we want to search for images using text or find similar images, we use a technique that involves turning both text and images into a form that a computer can understand easily. This is done by creating vectors. Think of vectors as a list of numbers that describe the important parts of a text or an image.

For text, we use a tool called ‘text-embedding-ada-002’ which changes the words and sentences into a vector. This vector is like a unique fingerprint of the text. For images, we use something like Azure’s multi-modal embedding API, which does the same thing but for pictures. It turns the image into a vector that represents what’s in the picture.

After we have these vectors, we store them in a place where they can be searched. We will use Azure AI Search. When you search, the system looks for the closest matching vectors – it’s like finding the most similar fingerprint, whether it’s from text or an image. This helps the computer to give you the right image when you search with words or find images that are similar to the one you have.

Getting a description from an image

Although Azure has Computer Vision APIs to describe images, GPT-4 with vision can do the same thing. It is more flexible and easier to use because you have the ability to ask for what you want with a prompt.

To provide an image to the model, you can provide a URL or the base64 encoding of an image file. The code below uses the latter approach:

def describe_image(image_file: str) -> str:
    with open(f'{image_file}', 'rb') as f:
        image_base64 = base64.b64encode(f.read()).decode('utf-8')
        print(image_base64[:100] + '...')

    print(f"Describing {image_file}...")

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_base64}",
                },
                },
            ],
            }
        ],
        max_tokens=500,  # default max tokens is low so set higher
    )

    return response.choices[0].message.content

As usual, the OpenAI API is very easy to use. Above, we open and read the image and base64-encode it. The base64 encoded file is provided in the url field. A simple prompt is all you need to get the description. Let’s look at the result for the picture below:

The generated description is below:

The image displays a single, healthy-looking cactus planted in a terracotta-colored pot against a pale pink background. The cactus is elongated, predominantly green with some modest blue hues, and has evenly spaced spines covering its surface. The spines are white or light yellow, quite long, and arranged in rows along the cactus’s ridges. The pot has a classic, cylindrical shape with a slight lip at the top and appears to be a typical pot for houseplants. The overall scene is minimalistic, with a focus on the cactus and the pot due to the plain background, which provides a soft contrast to the vibrant colors of the plant and its container.

Description generated by GPT-4 Vision

Embedding of the image

To create an embedding for the image, I decided to use Azure’s multi-modal embedding API. Take a look at the code below:

def get_image_vector(image_path: str) -> list:
    # Define the URL, headers, and data
    url = "https://AI_ACCOUNT.cognitiveservices.azure.com//computervision/retrieval:vectorizeImage?api-version=2023-02-01-preview&modelVersion=latest"
    headers = {
        "Content-Type": "application/octet-stream",
        "Ocp-Apim-Subscription-Key": os.getenv("AZURE_AI_KEY")
    }

    with open(image_path, 'rb') as image_file:
        # Read the contents of the image file
        image_data = image_file.read()

    print(f"Getting vector for {image_path}...")

    # Send a POST request
    response = requests.post(url, headers=headers, data=image_data)

    # return the vector
    return response.json().get('vector')

The code uses an environment variable to get the key to an Azure AI Services multi-service endpoint. Check the README.md in the repository for a sample .env file.

The API generates a vector with 1024 dimensions. We will need that number when we create the Azure AI Search index.

Note that this API can accept a url or the raw image data (not base64-encoded). Above, we provide the raw image data and set the Content-Type properly.

Generating the data to index

In the next step, we will get all .jpg files from a folder and do the following:

  • create the description
  • create the image vector
  • create the text vector of the description

Check the code below for the details:

# get all *.jpg files in the images folder
image_files = [file for file in os.listdir('./images') if file.endswith('.jpg')]

# describe each image and store filename and description in a list of dicts
descriptions = []
for image_file in image_files:
    try:
        description = describe_image(f"./images/{image_file}")
        image_vector = get_image_vector(f"./images/{image_file}")
        text_vector = get_text_vector(description)
        
        descriptions.append({
            'id': image_file.split('.')[0], # remove file extension
            'fileName': image_file,
            'imageDescription': description,
            'imageVector': image_vector,
            'textVector': text_vector
        })
    except Exception as e:
        print(f"Error describing {image_file}: {e}")

# print the descriptions but only show first 5 numbers in vector
for description in descriptions:
    print(f"{description['fileName']}: {description['imageDescription'][:50]}... {description['imageVector'][:5]}... {description['textVector'][:5]}...")

The important part is the descriptions list, which is a list of JSON objects with fields that match the fields in the Azure AI Search index we will build in the next step.

The text vector is calculated with the get_text_vector function. It uses OpenAI’s text-embedding-ada-002 model.

Building the index

The code below uses the Azure AI Search Python SDK to build and populate the index in code. You.will need an AZURE_AI_SEARCH_KEY environment variable to authenticate to your Azure AI Search instance.

def blog_index(name: str):
    from azure.search.documents.indexes.models import (
        SearchIndex,
        SearchField,
        SearchFieldDataType,
        SimpleField,
        SearchableField,
        VectorSearch,
        VectorSearchProfile,
        HnswAlgorithmConfiguration,
    )

    fields = [
        SimpleField(name="Id", type=SearchFieldDataType.String, key=True), # key
        SearchableField(name="fileName", type=SearchFieldDataType.String),
        SearchableField(name="imageDescription", type=SearchFieldDataType.String),
        SearchField(
            name="imageVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1024,
            vector_search_profile_name="vector_config"
        ),
        SearchField(
            name="textVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,
            vector_search_profile_name="vector_config"
        ),

    ]

    vector_search = VectorSearch(
        profiles=[VectorSearchProfile(name="vector_config", algorithm_configuration_name="algo_config")],
        algorithms=[HnswAlgorithmConfiguration(name="algo_config")],
    )
    return SearchIndex(name=name, fields=fields, vector_search=vector_search)

#  create the index
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import VectorizedQuery

service_endpoint = "https://YOUR_SEARCH_INSTANCE.search.windows.net"
index_name = "image-index"
key = os.getenv("AZURE_AI_SEARCH_KEY")

index_client = SearchIndexClient(service_endpoint, AzureKeyCredential(key))
index = blog_index(index_name)

# create the index
try:
    index_client.create_index(index)
    print("Index created")
except Exception as e:
    print("Index probably already exists", e)

The code above creates an index with some string fields and two vector fields:

  • imageVector: 1024 dimensions (as defined by the Azure AI Computer Vision image embedder)
  • textVector: 1536 dimensions (as defined by the OpenAI embedding model)

Although not specified in the code, the index will use cosine similarity to perform similarity searches. It’s the default. It will return approximate nearest neighbour (ANN) results unless you create a search client that uses exhaustive search. An exhaustive search searches the entire vector space. The queries near the end of this post use the exhaustive setting.

When the index is created, we can upload documents:

# now upload the documents
try:
    search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))
    search_client.upload_documents(descriptions)
    print("Documents uploaded successfully")
except Exception as e:
    print("Error uploading documents", e)

The upload_documents method uploads the documents in the descriptions Python list to the search index. The upload is actually an upsert. You can run this code multiple times without creating duplicate documents in the index.

Search images with text

To search an image with a text description, a vector query on the textVector is used. The function below takes a text query string as input, vectorizes the query, and performs a similarity search returning the first nearest neighbour. The function displays the description and the image in the notebook:

# now search based on text
def single_vector_search(query: str):
    vector_query = VectorizedQuery(vector=get_text_vector(query), k_nearest_neighbors=1, fields="textVector", exhaustive=True)

    results = search_client.search(
        vector_queries=[vector_query],
        select=["fileName", "imageDescription"],
        
    )

    for result in results:
        print(result['fileName'], result["imageDescription"], sep=": ")

        # show the image
        from IPython.display import Image
        display(Image(f"./images/{result['fileName']}"))
    
single_vector_search("desert plant")

The code searches for an image based on the query desert plant. It returns the picture of the cactus shown earlier. Note that if you search for something there is no image for, like blue car, you will still get a result because we always return a nearest neighbor. Even if your nearest neighbor lives 100km away, it’s still your nearest neighbor. 😀

Return similar images

Since our index contains an image vector, we can search for images similar to a vector of a reference image. The function below takes an image file path as input, calculates the vector for that image, and performs a nearest neighbor search. The function displays the description and image of each document returned. In this case, the code returns two similar documents:

def image_search(image_file: str):

    vector_query = VectorizedQuery(vector=get_image_vector(image_file), k_nearest_neighbors=2, fields="imageVector", exhaustive=True)

    results = search_client.search(
        vector_queries=[vector_query],
        select=["fileName", "imageDescription"],
    )

    for result in results:
        print(result['fileName'], result["imageDescription"], sep=": ")

        # show the image
        from IPython.display import Image
        display(Image(f"./images/{result['fileName']}"))
 
# get vector of another image and find closest match
image_search('rotten-apple.jpg')
image_search('flower.jpeg')

At the bottom, the function is called with filenames of pictures that contain a rotten apple and a flower. The result of the first query is a picture of the apple and banana. The result of the second query is the cactus and the rose. You can debate whether the cactus should be in the results. Some cacti have flowers but some don’t. 😀

Conclusion

The GPT-4 Vision API, like most OpenAI APIs, is very easy to use. In this post, we used it to generate image descriptions to build a simple query engine that can search for images via text or a reference image. Together with their text embedding API and Microsoft’s multi-modal embedding API to create an image embedding, it is relatively straightforward to build these type of systems.

As usual, this is a tutorial with quick sample code to illustrate the basic principle. If you need help building these systems in production, simply reach out. Help is around the corner! 😉

Using Integrated Vectorization in Azure AI Search

The vector search capability of Azure AI Search became generally available mid November 2023. With that release, the developer is responsible for creating embeddings and storing them in a vector field in the index.

However, Microsoft also released integrated vectorization in preview. Integrated vectorization is useful in two ways:

  • You can define a vectorizer in the index schema. It can be used to automatically convert a query to a vector. This is useful in the Search Explorer in the portal but can also be used programmatically.
  • You can use an embedding skill for your indexer that automatically vectorizes index fields for you.

First, let’s look at defining a vectorizer in the index definition and using it in the portal for search.

Vector search in the portal

Below is a screenshot of an index with a title and a titleVector field. The index stores information about movies:

Index with a vector field

The integrated vectorizer is defined in the Vector profiles section:

Vector profile

When you add the profile, you configure the algorithm and vectorizer. The vectorizer simply points to an embedding model in Azure OpenAI. For example:

Vectorizer

Note: it’s recommended to use managed identity

Now, from JSON View in Search Explorer, you can perform a vector search. If you see a search field at the top, you can remove that. It’s for full-text search.

Vector search in the portal

Above, the query commencement is converted to a vector by the integrated vectorizer. The vector search comes up with Inception as the first match. I am not sure if you would want to search for movies this way but it proves the point. 😛

Using an embedding skill during indexing

Suppose you have several JSON documents about movies. Below is one example:

{
    "title": "Inception",
    "year": 2010,
    "director": "Christopher Nolan",
    "genre": ["Action", "Adventure", "Sci-Fi"],
    "starring": ["Leonardo DiCaprio", "Joseph Gordon-Levitt", "Ellen Page"],
    "imdb_rating": 8.8
  }

When you have a bunch of these files in Azure Blob Storage, you can use the Import Data wizard to construct an index from these files.

Import Data Wizard

This wizard, at the time of writing, does not create vectors for you. There is another wizard, Import and vectorize data, but it will treat the JSON as any document and store it in a content field. A vector is created from the content field.

We will stick to the first wizard. It will do several things:

  • create a data source to access the JSON documents in an Azure Storage Account container
  • infer the schema from the JSON files
  • propose an index definition that you can alter
  • create an indexer that indexes the documents on the schedule that you set
  • add skills like entity extraction; select a simple skill here like translation so you are sure there will be a skillset that the indexer will use

In the step to customize the index definition, ensure you make fields searchable and retrievable as needed. In addition, define a vector field. In my case, I created a titleVector field:

titleVector

When the wizard is finished, the indexer will run and populate the index. Of course, the titleVector field will be empty because there is no process in place that calculates the vectors during indexing.

Let’s fix that. In Skillsets, go the the skillset created by the wizard and click it.

Skillset created by the wizard

Replace the Skillset JSON definition with the content below and change resourceUri, apiKey and deploymentId as needed. You can also add the embedding skill to the existing array of skills if you want to keep them.

{
  "@odata.context": "https://acs-geba.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8DBF01523E9A94D\"",
  "name": "azureblob-skillset",
  "description": "Skillset created from the portal. skillsetName: azureblob-skillset; contentField: title; enrichmentGranularity: document; knowledgeStoreStorageAccount: ;",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "embed",
      "description": null,
      "context": "/document",
      "resourceUri": "https://OPENAI_INSTANCE.openai.azure.com",
      "apiKey": "AZURE_OPENAI_KEY",
      "deploymentId": "EMBEDDING_MODEL",
      "inputs": [
        {
          "name": "text",
          "source": "/document/title"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "titleVector"
        }
      ],
      "authIdentity": null
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": null,
  "encryptionKey": null
}

Above, we want to embed the title field in our document and create a vector for it. The context is set to /document which means that this skill is executed for each document once.

Now save the skillset. This skill on its own will create the vectors but will not save them in the index. You need to update the indexer to write the vector to a field.

Let’s navigate to the indexer:

Indexer

Click the indexer and go to the Indexer Definition (JSON) tab. Ensure you have an outputFieldMappings section like below:

{
  "@odata.context": "https://acs-geba.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DBF01561D9E97F\"",
  "name": "movies-indexer",
  "description": "",
  "dataSourceName": "movies",
  "skillsetName": "azureblob-skillset",
  "targetIndexName": "movies-index",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0,
    "base64EncodeKeys": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "json"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "metadata_storage_path",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    }
  ],
  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/titleVector",
      "targetFieldName": "titleVector"
    }
  ],
  "cache": null,
  "encryptionKey": null
}

Above, we map the titleVector enrichment (think of it as something temporary during indexing) to the real titleVector field in the index.

Reset and run the indexer

Reset the indexer so it will index all documents again:

Resetting the indexer

Next, click the Run button to start the indexing process. When it finishes, do a search with Search Explorer and check that there are vectors in the titleVector field. It’s an array of 1536 floating point numbers.

Conclusion

Integrated vectorization is a welcome extra feature in Azure AI Search. Using it in searches is very easy, especially in the portal.

Using the embedding skill is a bit harder, because you need to work with skillset and indexer definitions in JSON and you have to know exactly what you have to add. But once you get it right, the indexer does all the vectorization work for you.

Improvements in Azure OpenAI Add your data

In a previous post, I talked about the Add your data feature in the Azure OpenAI Chat playground. Recently, there have been some updates to this feature, including vector search. Let’s take a look at the updated experience and focus on vector search.

Starting point

I have some PDF documents in a storage account container. They are PDFs containing job descriptions for a select group of companies. You can use .txt, .md, .html, Microsoft Word files, Microsoft PowerPoint files, or PDFs.

PDFs in a storage account

At the storage account level, CORS settings should be GET from all origins (*). This can also be set from the Add your data wizard in the OpenAI Playground.

CORS settings

In addition to the storage account, you need an OpenAI resource deployed to a region of choice. I have chosen France Central which provides access to gpt-4 and the text-embedding-ada-002 embedding model (a text embedding model is required for vector search). Ensure those models are deployed. For example:

Deployed models in France Central

Running the wizard

In OpenAI Chat Playground, you will find the Add your data (preview) tab. Use the + Add a data source button to start.

There are several sources to start from. Because I already have my files in a storage account, I will select Azure Blob Storage as the source and select the name of the storage account and the container with my files. You can also upload files or use an existing index in Cognitive Search. Whatever the option you choose, you will always end up with a Cognitive Search index that serves relevant content to the chat.

Data source options

As you can see from the above screenshot, in addition to the storage account, you have to select an Azure Cognitive Search instance. It will not be created for you. If you do not have such an instance, either click the link under the Select Azure Cognitive Search resource dropdown or create one yourself and use the refresh icon. I already have such a resource called acs-geba. Use the Basic pricing tier as a minimum. This gives you a vector quota of 1GB.

After selecting the Azure Cognitive Search resource, enter an index name. The documents in the storage account will be added to this index so we can search via this index later. The index will be created for you. I will use oai as the index name and also set a schedule to Hourly to update the index automatically. The schedule can be adjusted afterward in Azure Cognitive Search.

We now have the following in the wizard:

Add your data, data source config

You can now add vector search. This is in addition to keyword and semantic search. To use vector search, you need to specify an embedding model. If you do not have text-embedding-ada-002 deployed in your region, you will not be able to turn on vector search. This feature requires at least the Basic or higher SKU.

Turning on vector search (still in the first page of the wizard)

Above, I called my deployment of the text-embedding-ada-002 model embedding but you can use any name you like. It’s just a deployment name.

Now we can press Next, to be presented with the Data management page:

Data management page

You can find more information about those options here. In most cases, using Vector search alone is sufficient but it depends a bit on your dataset and use-case. I will just use Vector search. When we use Redis, Qdrant, Pinecone, or other vector stores, we also use vectorized search exclusively, which works very well.

After clicking Next, review what will happen and click Save and Close. The data will be added to the index:

Data is being added

Asking questions about the data

When the data has been added to Azure Cognitive Search, you can start asking questions. If you want to limit the chat to only your data, ensure that Limit Responses to your data content is checked.

Ready to go

In the Chat Platground, I selected gpt-4 and asked the following question: “Who are MBarQ and do they need AI translators?”. The answer is as follows:

Asking a question

This answer comes from one of the PDFs containing the job description.

Behind the scenes

For the above interaction to work, the question “Who are MBarQ and do they need AI translators?” is vectorized using the selected embedding model. Let’s call this the query vector. The selected embedding model creates a vector with 1536 dimensions that represent the text within a vector space. The nice thing here is that the embedding of the query is created automatically as part of the extended Azure OpenAI API.

The vectors for your documents are stored in an index that ends with the word chunks. Here’s my index and its defined fields. This is all the result of the wizard. No changes have been made to Azure Cognitive Search manually.

Index used for vector search and its fields

As you can see, there is a field for the contentVector which also notes the number of dimensions. The embedding model we used just happens to output 1536 numbers. Other embedding models use a different number of dimensions. Next to the contentVector, the content field contains the actual text that the vector was created from. That text will later be injected, behind the scenes, in the gpt-4 prompt. But we first have to find these pieces of text!

With the query vector in hand, Add your data searches for pieces of text with vectors that are close to the query vector. Cognitive Search uses cosine similarity to do that but there is no real need to know that. Note that we only use vector search in this scenario. When you do hybrid and/or semantic searches, the query process is different. Also, note that the index with vectors works on chunks of text coming from your documents. This chunking happens transparently in the background when the indexer runs.

Once the top N (usually n=5 but can be adjusted in code) vectors that are closest to the query vector are found, we also have the pieces of text closest to the query (from the content field). The original pieces of text that the vectors were calculated from get added to the query and sent to gpt-4. The prompt sent to gpt-4 could be something like the one below (just an example):

Who are MBarQ and do they need AI translators?

Only answer based on the context below after ---

---

First piece of text (no vectors here, just plain text!!!)

Second piece of text

...

Based on this prompt and hopefully relevant context below the — mark, the model can answer the original question.

Note that the Add your data experience also returns references. In the UI, you can click these to see the source text:

References

Deployment

From the Playground, you can deploy the chat experience to a web app or a Power Virtual Agent bot:

Deployment

At this moment (September 2023) the Power Virtual Agent deployment does not work if your default environment is not in the United States. When you click A new Power Virtual Agent bot…, you should quickly copy the URL and replace the environment ID with another one that is in the United States. Navigate to the modified URL to create the bot.

Deploying to a web app is a bit more straightforward because that is just a web app in Azure. No Power Platform madness here… 😀

Note that if you enable chat history, CosmosDB is used. Here’s the app with chat history visible at the right, similar to chat history in ChatGPT or Bing Chat. This app uses Azure Active Directory (Microsoft Entra ID) for authentication.

Chat in web app

Conclusion

The main addition to Add your data surely is vector search. That capability was already a part of Cognitive Search but the Add your data feature did not use it. When you do use it, a lot of stuff happens in the background automatically:

  • An index that supports vectors is created; if selected the index is automatically updated based on the contents in the storage account container
  • Documents are chunked and vectors are created for each of these chunks based on the selected embeddings model
  • There is no need to vectorize the user’s query yourself, performing a nearest neighbor search and stuffing the gpt prompt with content; everything is handled by the underlying API

It will be interesting to see how it evolves further.

Enhancing Semantic Search with a Streamlit UI

In a previous blog post, we discussed two Python programs, upload_vectors.py and search_vectors.py. These programs were used to create and search vectors, respectively. The upload_vectors.py script created vectors from chunks of a larger text and stored them in Pinecone, while the search_vectors.py script enabled semantic search on the text. In this blog post, we will discuss how to create a user interface (UI) for these two programs using Streamlit.

🚀 I kickstarted the Streamlit app by handing over the text-based version to ChatGPT and asking it to work its magic ✨💻. Yes, it was that easy! Afterwards, I made several manual changes to make it look the way I wanted.

Pinecone, Vectors, Embeddings, and Semantic Search: What’s all that about?

Pinecone is a vector database service that allows for easy storage and retrieval of high-dimensional vectors. It is optimized for similarity search, which makes it a perfect fit for tasks like semantic search. Our script stores vectors in Pinecone by parsing an RSS feed, chunking the blog posts, and creating the vectors with OpenAI’s embedding APIs.

Vectors are mathematical representations of data in the form of an array of numbers. In our case, we use vectors to represent chunks of text retrieved from blog posts. These vectors are generated using a process called embedding, which is a way of representing complex data, like text, in a lower-dimensional space while preserving the essential information.

Semantic search is a type of search that goes beyond keyword matching to understand the meaning and context of the query. By using vector embeddings, we can compare the similarity between queries and stored texts to find the most relevant results. Pinecone does that search for us and simply returns a number of matching chunks (pieces of text).

What is Streamlit?

Streamlit is a Python library that makes it easy to create custom web apps for machine learning and data science projects. You can build interactive UIs with minimal code, allowing you to focus on the core logic of your application.

Here’s an example of creating an extremely simple Streamlit app:

import streamlit as st

st.title('Hello, Streamlit!')
st.write('This is a simple Streamlit app.')

This code would generate a web app with a title and a text output. You can also create more complex UIs with user input, like sliders, text inputs, and buttons.

Creating a Streamlit UI for Semantic Search

Now let’s examine the provided code for creating a Streamlit UI for the search_vectors.py program. The code can be broken down into the following sections:

  1. Import necessary libraries and check environment variables.
  2. Set up the tokenizer and define the tiktoken_len function.
  3. Create the UI elements, including the title, text input, dropdown, sliders, and buttons.
  4. Define the main search functionality that is triggered when the user clicks the “Search” button.

Here is the full code:

import os
import pinecone
import openai
import tiktoken
import streamlit as st

# check environment variables
if os.getenv('PINECONE_API_KEY') is None:
    st.error("PINECONE_API_KEY not set. Please set this environment variable and restart the app.")
if os.getenv('PINECONE_ENVIRONMENT') is None:
    st.error("PINECONE_ENVIRONMENT not set. Please set this environment variable and restart the app.")
if os.getenv('OPENAI_API_KEY') is None:
    st.error("OPENAI_API_KEY not set. Please set this environment variable and restart the app.")

# use cl100k_base tokenizer for gpt-3.5-turbo and gpt-4
tokenizer = tiktoken.get_encoding('cl100k_base')


def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# create a title for the app
st.title("Search blog feed 🔎")

# create a text input for the user query
your_query = st.text_input("What would you like to know?")
model = st.selectbox("Model", ["gpt-3.5-turbo", "gpt-4"])

with st.expander("Options"):

    max_chunks = 5
    if model == "gpt-4":
        max_chunks = 15

    max_reply_tokens = 1250
    if model == "gpt-4":
        max_reply_tokens = 2000

    col1, col2 = st.columns(2)

    # model dropdown
    with col1:
        chunks = st.slider("Number of chunks", 1, max_chunks, 5)
        temperature = st.slider("Temperature", 0.0, 1.0, 0.0)

    with col2:
        reply_tokens = st.slider("Reply tokens", 750, max_reply_tokens, 750)
    

# create a submit button
if st.button("Search"):
    # get the Pinecone API key and environment
    pinecone_api = os.getenv('PINECONE_API_KEY')
    pinecone_env = os.getenv('PINECONE_ENVIRONMENT')

    pinecone.init(api_key=pinecone_api, environment=pinecone_env)

    # set index
    index = pinecone.Index('blog-index')


    # vectorize your query with openai
    try:
        query_vector = openai.Embedding.create(
            input=your_query,
            model="text-embedding-ada-002"
        )["data"][0]["embedding"]
    except Exception as e:
        st.error(f"Error calling OpenAI Embedding API: {e}")
        st.stop()

    # search for the most similar vector in Pinecone
    search_response = index.query(
        top_k=chunks,
        vector=query_vector,
        include_metadata=True)

    # create a list of urls from search_response['matches']['metadata']['url']
    urls = [item["metadata"]['url'] for item in search_response['matches']]

    # make urls unique
    urls = list(set(urls))

    # create a list of texts from search_response['matches']['metadata']['text']
    chunk_texts = [item["metadata"]['text'] for item in search_response['matches']]

    # combine texts into one string to insert in prompt
    all_chunks = "\n".join(chunk_texts)

    # show urls of the chunks
    with st.expander("URLs", expanded=True):
        for url in urls:
            st.markdown(f"* {url}")
    

    with st.expander("Chunks"):
        for i, t in enumerate(chunk_texts):
            # remove newlines from chunk
            tokens = tiktoken_len(t)
            t = t.replace("\n", " ")
            st.write("Chunk ", i, "(Tokens: ", tokens, ") - ", t[:50] + "...")
    with st.spinner("Summarizing..."):
        try:
            prompt = f"""Answer the following query based on the context below ---: {your_query}
                                                        Do not answer beyond this context!
                                                        ---
                                                        {all_chunks}"""


            # openai chatgpt with article as context
            # chat api is cheaper than gpt: 0.002 / 1000 tokens
            response = openai.ChatCompletion.create(
                model=model,
                messages=[
                    { "role": "system", "content":  "You are a truthful assistant!" },
                    { "role": "user", "content": prompt }
                ],
                temperature=temperature,
                max_tokens=max_reply_tokens
            )

            st.markdown("### Answer:")
            st.write(response.choices[0]['message']['content'])

            with st.expander("More information"):
                st.write("Query: ", your_query)
                st.write("Full Response: ", response)

            with st.expander("Full Prompt"):
                st.write(prompt)

            st.balloons()
        except Exception as e:
            st.error(f"Error with OpenAI Completion: {e}")

A closer look

The code first imports the necessary libraries and checks if the required environment variables are set, displaying an error message if they are not. The libraries you need are in requirements.txt on GitHub. You can install them with:

pip3 install -r requirements.txt

ℹ️ I recommend using a Python virtual environment when you install these dependencies; see poetry (just one example)

The tiktoken_len function calculates the token length of a given text using the tokenizer. This is used to display the tokens of each chunk of text we set to the ChatCompletion API. Depending on the model, 4096 or 8192 tokens are supported.

The UI is built using Streamlit functions, such as st.title, st.text_input, st.selectbox, and st.columns. These functions create various UI elements that the user can interact with to input their query and set search parameters. If you look at the code, you will see how easy it is to add those elements.

With the UI elements, you can set:

  • the number of text chunks to return from Pinecone and to forward to the ChatCompletion API (using st.slider)
  • the number of tokens to reply with (using st.slider)
  • the model: gpt-3.5-turbo or gpt-4 (ensure you have access to the gpt-4 API)
  • the temperature (using st-slider)

The options are shown in two columns with st.columns.

The main search functionality is triggered when the user clicks the “Search” button. The code then vectorizes the query, searches for the most similar vectors in Pinecone, and displays the URLs and chunks found. Finally, the selected model is used to generate an answer based on the chunks found and the user’s query. Often, gpt-4 will provide the best answer. It seems to be able to better understand all the chunks of text thrown at it.

Running the code

To run the code you need the following:

  • A Pinecode API key and environment
  • An OpenAI API key

It is easiest to run the code with Docker. If you have it installed, run the following command:

docker run -p 8501:8501 -e OPENAI_API_KEY="YOURKEY" \
    -e PINECONE_API_KEY="YOURKEY" \
    -e PINECONE_ENVIRONMENT="YOURENV" gbaeke/blogsearch

The gbaeke/blogsearch image is available on Docker Hub. You can also build your own with the Dockerfile provided on GitHub.

After running the image, go to http://localhost:8501 and first use the Upload page to create your Pinecode index and store vectors in it. You can use my blog’s feed or any other feed. You can experiment with the chunk size and chunk overlap.

Upload to Pinecone

You can add multiple RSS feeds one-by-one as long as you turn off Recreate index before each new upload. After you have populated the index, use the Search page to start searching:

Searching

Above, we ask what we can do with Pinecone and let gpt-4 do the answering. The similarity search will search for 5 similar items and return them. We show the original URLs these results come from. In the Chunks section, you can see the original chunks because they are also in Pinecone as metadata. After the answer, you can find the full JSON returned by the ChatCompletion API and the full prompt we sent to that API.

Conclusion

In this blog post, we showed you how to create a Streamlit UI for the search_vectors.py script we talked about in a previous post. Streamlit allows you to easily build interactive web applications for your machine learning and data science projects. We also created a UI to upload posts to Pinecone. The full program allows you to add as much data as you want and query that data with semantic search, summarized and synthesized by the GPT model of choice. Give it a try and let me know what you think.

Enhancing Blog Post Search with Chunk-based Embeddings and Pinecone

In this blog post, we’ll show you a different approach to searching through a large database of blog posts. The previous approach involved creating a single embedding for the entire article and storing it in a vector database. The new approach is much more effective, and in this post, we’ll explain why and how to implement it.

The new approach involves the following steps:

  1. Chunk the article into pieces of about 400 tokens using LangChain
  2. Create an embedding for each chunk
  3. Store each embedding, along with its metadata such as the URL and the original text, in Pinecone
  4. Store the original text in Pinecone, but not indexed
  5. To search the blog posts, find the 5 best matching chunks and add them to the ChatCompletion prompt

We’ll explain each step in more detail below, but first, let’s start with a brief overview of the previous approach.

The previous approach used OpenAI’s embeddings API to vectorize the blog post articles and Pinecone, a vector database, to store and query the vectors. The article was vectorized as a whole, and the resulting vector was stored in Pinecone. To search the blog posts, cosine similarity was used to find the closest matching article, and the contents of the article were retrieved using the Python requests library and the BeautifulSoup library. Finally, a prompt was created for the ChatCompletion API, including the retrieved article.

The problem with this approach was that the entire article was vectorized as one piece. This meant that if the article was long, the vector might not represent the article accurately, as it would be too general. Moreover, if the article was too long, the ChatCompletion API call might fail because too many tokens were used.

The new approach solves these problems by chunking the article into smaller pieces, creating an embedding for each chunk, and storing each embedding in Pinecone. This way, we have a much more accurate representation of the article, as each chunk represents a smaller, more specific part of the article. And because each chunk is smaller, there is less risk of using too many tokens in the ChatCompletion API call.

To implement the new approach, we’ll use LangChain to chunk the article into pieces of about 400 tokens. LangChain is a library aimed at assisting in the development of applications that use LLMs, or large language models.

Next, we’ll create an embedding for each chunk using OpenAI’s embeddings API. As before, we will use the text-embedding-ada-002 model. And once we have the embeddings, we’ll store each one, along with its metadata, in Pinecone. The key for each embedding will be a hash of the URL, combined with the chunk number.

The original text will also be stored in Pinecone, but not indexed, so that it can be retrieved later. With this approach, we do not need to retrieve a blog article from the web. Instead, we just get the text from Pinecone directly.

To search the blog posts, we’ll use cosine similarity to find the 5 best-matching chunks. The 5 best matching chunks will be added to the ChatCompletion prompt, allowing us to ask questions based on the article’s contents.

Uploading the embeddings

The code to upload the embeddings is shown below. You will need to set the following environment variables:

export OPENAI_API_KEY=your_openai_api_key
export PINECONE_API_KEY=your_pinecone_api_key
export PINECONE_ENVIRONMENT=your_pinecone_environment
import feedparser
import os
import pinecone
import openai
import requests
from bs4 import BeautifulSoup
from retrying import retry
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
import hashlib

# use cl100k_base tokenizer for gpt-3.5-turbo and gpt-4
tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function used by the RecursiveCharacterTextSplitter
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000)
def create_embedding(article):
    # vectorize with OpenAI text-emebdding-ada-002
    embedding = openai.Embedding.create(
        input=article,
        model="text-embedding-ada-002"
    )

    return embedding["data"][0]["embedding"]

# OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

# get the Pinecone API key and environment
pinecone_api = os.getenv('PINECONE_API_KEY')
pinecone_env = os.getenv('PINECONE_ENVIRONMENT')

pinecone.init(api_key=pinecone_api, environment=pinecone_env)

if "blog-index" not in pinecone.list_indexes():
    print("Index does not exist. Creating...")
    pinecone.create_index("blog-index", 1536, metadata_config= {"indexed": ["url", "chunk-id"]})
else:
    print("Index already exists. Deleting...")
    pinecone.delete_index("blog-index")
    print("Creating new index...")
    pinecone.create_index("blog-index", 1536, metadata_config= {"indexed": ["url", "chunk-id"]})

# set index; must exist
index = pinecone.Index('blog-index')

# URL of the RSS feed to parse
url = 'https://atomic-temporary-16150886.wpcomstaging.com/feed/'

# Parse the RSS feed with feedparser
print("Parsing RSS feed: ", url)
feed = feedparser.parse(url)

# get number of entries in feed
entries = len(feed.entries)
print("Number of entries: ", entries)

# create recursive text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

pinecone_vectors = []
for i, entry in enumerate(feed.entries[:50]):
    # report progress
    print("Create embeddings for entry ", i, " of ", entries, " (", entry.link, ")")

    r = requests.get(entry.link)
    soup = BeautifulSoup(r.text, 'html.parser')
    article = soup.find('div', {'class': 'entry-content'}).text

    # create chunks
    chunks = text_splitter.split_text(article)

    # create md5 hash of entry.link
    url = entry.link
    url_hash = hashlib.md5(url.encode("utf-8"))
    url_hash = url_hash.hexdigest()
        
    # create embeddings for each chunk
    for j, chunk in enumerate(chunks):
        print("\tCreating embedding for chunk ", j, " of ", len(chunks))
        vector = create_embedding(chunk)

        # concatenate hash and j
        hash_j = url_hash + str(j)

        # add vector to pinecone_vectors list
        print("\tAdding vector to pinecone_vectors list for chunk ", j, " of ", len(chunks))
        pinecone_vectors.append((hash_j, vector, {"url": entry.link, "chunk-id": j, "text": chunk}))

        # upsert every 100 vectors
        if len(pinecone_vectors) % 100 == 0:
            print("Upserting batch of 100 vectors...")
            upsert_response = index.upsert(vectors=pinecone_vectors)
            pinecone_vectors = []

# if there are any vectors left, upsert them
if len(pinecone_vectors) > 0:
    print("Upserting remaining vectors...")
    upsert_response = index.upsert(vectors=pinecone_vectors)
    pinecone_vectors = []

print("Vector upload complete.")

Searching for blog posts

The code below is used to search blog posts:

import os
import pinecone
import openai
import tiktoken

# use cl100k_base tokenizer for gpt-3.5-turbo and gpt-4
tokenizer = tiktoken.get_encoding('cl100k_base')


def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# get the Pinecone API key and environment
pinecone_api = os.getenv('PINECONE_API_KEY')
pinecone_env = os.getenv('PINECONE_ENVIRONMENT')

pinecone.init(api_key=pinecone_api, environment=pinecone_env)

# set index
index = pinecone.Index('blog-index')

while True:
    # set query
    your_query = input("\nWhat would you like to know? ")
    
    # vectorize your query with openai
    try:
        query_vector = openai.Embedding.create(
            input=your_query,
            model="text-embedding-ada-002"
        )["data"][0]["embedding"]
    except Exception as e:
        print("Error calling OpenAI Embedding API: ", e)
        continue

    # search for the most similar vector in Pinecone
    search_response = index.query(
        top_k=5,
        vector=query_vector,
        include_metadata=True)

    # create a list of urls from search_response['matches']['metadata']['url']
    urls = [item["metadata"]['url'] for item in search_response['matches']]

    # make urls unique
    urls = list(set(urls))

    # create a list of texts from search_response['matches']['metadata']['text']
    chunks = [item["metadata"]['text'] for item in search_response['matches']]

    # combine texts into one string to insert in prompt
    all_chunks = "\n".join(chunks)

    # print urls of the chunks
    print("URLs:\n\n", urls)

    # print the text number and first 50 characters of each text
    print("\nChunks:\n")
    for i, t in enumerate(chunks):
        print(f"\nChunk {i}: {t[:50]}...")

    try:
        # openai chatgpt with article as context
        # chat api is cheaper than gpt: 0.002 / 1000 tokens
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                { "role": "system", "content":  "You are a thruthful assistant!" },
                { "role": "user", "content": f"""Answer the following query based on the context below ---: {your_query}
                                                    Do not answer beyond this context!
                                                    ---
                                                    {all_chunks}""" }
            ],
            temperature=0,
            max_tokens=750
        )

        print(f"\n{response.choices[0]['message']['content']}")
    except Exception as e:
        print(f"Error with OpenAI Completion: {e}")

In Action

Below, we ask if Redis supports storing vectors and what version of Redis we need in Azure. The Pinecone vector search found 5 chunks, all from the same blog post (there is only one URL). The five chunks are combined and sent to ChatGPT, together with the original question. The response from the ChatCompletion API is clear!

Example question and response

Conclusion

In conclusion, the “chunked” approach to searching through a database of blog posts is much more effective and solves many of the problems associated with the previous approach. We hope you found this post helpful, and we encourage you to try out the new approach in your own projects!

Pinecone and OpenAI magic: A guide to finding your long lost blog posts with vectorized search and ChatGPT

Searching through a large database of blog posts can be a daunting task, especially if there are thousands of articles. However, using vectorized search and cosine similarity, you can quickly query your blog posts and retrieve the most relevant content.

In this blog post, we’ll show you how to query a list of blog posts (from this blog) using a combination of vectorized search with cosine similarity and OpenAI ChatCompletions. We’ll be using OpenAI’s embeddings API to vectorize the blog post articles and Pinecone, a vector database, to store and query the vectors. We’ll also show you how to retrieve the contents of the article, create a prompt using the ChatCompletion API, and return the result to a web page.

ℹ️ Sample code is on GitHub: https://github.com/gbaeke/gpt-vectors

ℹ️ If you want an introduction to embeddings and cosine similarity, watch the video on YouTube by Part Time Larry.

Setting Up Pinecone

Before we can start querying our blog posts, we need to set up Pinecone. Pinecone is a vector database that makes it easy to store and query high-dimensional data. It’s perfect for our use case since we’ll be working with high-dimensional vectors.

ℹ️ Using a vector database is not strictly required. The GitHub repo contains app.py, which uses scikit-learn to create the vectors and perform a cosine similarity search. Many other approaches are possible. Pinecone just makes storing and querying the vectors super easy.

ℹ️ If you want more information about Pinecone and the concept of a vector database, watch this introduction video.

First, we’ll need to create an account with Pinecone and get the API key and environment name. In the Pinecone UI, you will find these as shown below. There will be a Show Key and Copy Key button in the Actions section next to the key.

Key and environment for Pinecone

Once we have an API key and the environment, we can use the Pinecone Python library to create and use indexes. Install the Pinecone library with pip install pinecone-client.

Although you can create a Pinecone index from code, we will create the index in the Pinecone portal. Go to Indexes and select Create Index. Create the index using cosine as metric and 1536 dimensions:

blog-index in Pinecone

The embedding model we will use to create the vectors, text-embedding-ada-002, outputs vectors with 1536 dimensions. For more info see OpenAI’s blog post of December 15, 2022.

To use the Pinecode index from code, look at the snippet below:

import pinecone

pinecone_api = "<your_api_key>"
pinecone_env = "<your_environment>"

pinecone.init(api_key=pinecone_api, environment=pinecone_env)

index = pinecone.Index('blog-index')

We create an instance of the Index class with the name “blog-index” and store this in index. This index will be used to store our blog post vectors or to perform searches on.

Vectorizing Blog Posts with OpenAI’s Embeddings API

Next, we’ll need to vectorize our blog post articles. We’ll be using OpenAI’s embeddings API to do this. The embeddings API takes a piece of text and returns a high-dimensional vector representation of that text. Here’s an example of how to do that for one article or string:

import openai

openai.api_key = "<your_api_key>"

article = "Some text from a blog post"

vector = openai.Embedding.create(
    input=article,
    model="text-embedding-ada-002"
)["data"][0]["embedding"]

We create a vector representation of our blog post article by calling the Embedding class’s create method. We pass in the article text as input and the text-embedding-ada-002 model, which is a pre-trained language model that can generate high-quality embeddings.

Storing Vectors in Pinecone

Once we have the vector representations of our blog post articles, we can store them in Pinecone. Instead of storing vector per vector, we can use upsert to store a list of vectors. The code below uses the feed of this blog to grab the URLs for 50 posts, every post is vectorized and the vector is added to a Python list of tuples, as expected by the upsert method. The list is then added to Pinecone at once. The tuple that Pinecone expects is:

(id, vector, metadata dictionary)

e.g. (0, vector for post 1, {"url": url to post 1}

Here is the code that uploads the first 50 posts of baeke.info to Pinecone. You need to set the Pinecone key and environment and the OpenAI key as environment variables. The code uses feedparser to grab the blog feed, and BeatifulSoup to parse the retrieved HTML. The code serves as an example only. It is not very robust when it comes to error checking etc…

import feedparser
import os
import pinecone
import numpy as np
import openai
import requests
from bs4 import BeautifulSoup

# OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

# get the Pinecone API key and environment
pinecone_api = os.getenv('PINECONE_API_KEY')
pinecone_env = os.getenv('PINECONE_ENVIRONMENT')

pinecone.init(api_key=pinecone_api, environment=pinecone_env)

# set index; must exist
index = pinecone.Index('blog-index')

# URL of the RSS feed to parse
url = 'https://atomic-temporary-16150886.wpcomstaging.com/feed/'

# Parse the RSS feed with feedparser
feed = feedparser.parse(url)

# get number of entries in feed
entries = len(feed.entries)
print("Number of entries: ", entries)

post_texts = []
pinecone_vectors = []
for i, entry in enumerate(feed.entries[:50]):
    # report progress
    print("Processing entry ", i, " of ", entries)

    r = requests.get(entry.link)
    soup = BeautifulSoup(r.text, 'html.parser')
    article = soup.find('div', {'class': 'entry-content'}).text

    # vectorize with OpenAI text-emebdding-ada-002
    embedding = openai.Embedding.create(
        input=article,
        model="text-embedding-ada-002"
    )

    # print the embedding (length = 1536)
    vector = embedding["data"][0]["embedding"]

    # append tuple to pinecone_vectors list
    pinecone_vectors.append((str(i), vector, {"url": entry.link}))

# all vectors can be upserted to pinecode in one go
upsert_response = index.upsert(vectors=pinecone_vectors)

print("Vector upload complete.")

Querying Vectors with Pinecone

Now that we have stored our blog post vectors in Pinecone, we can start querying them. We’ll use cosine similarity to find the closest matching blog post. Here is some code that does just that:

query_vector = <vector representation of query>  # vector created with OpenAI as well

search_response = index.query(
    top_k=5,
    vector=query_vector,
    include_metadata=True
)

url = get_highest_score_url(search_response['matches'])

def get_highest_score_url(items):
    highest_score_item = max(items, key=lambda item: item["score"])

    if highest_score_item["score"] > 0.8:
        return highest_score_item["metadata"]['url']
    else:
        return ""

We create a vector representation of our query (you don’t see that here but it’s the same code used to vectorize the blog posts) and pass it to the query method of the Pinecone Index class. We set top_k=5 to retrieve the top 5 matching blog posts. We also set include_metadata=True to include the metadata associated with each vector in our response. That way, we also have the URL of the top 5 matching posts.

The query method returns a dictionary that contains a matches key. The matches value is a list of dictionaries, with each dictionary representing a matching blog post. The score key in each dictionary represents the cosine similarity score between the query vector and the blog post vector. We use the get_highest_score_url function to find the blog post with the highest cosine similarity score.

The function contains some code to only return the highest scoring URL if the score is > 0.8. It’s of course up to you to accept lower matching results. There is a potential for the vector query to deliver an article that’s not highly relevant which results in an irrelevant context for the OpenAI ChatCompletion API call we will do later.

Retrieving the Contents of the Blog Post

Once we have the URL of the closest matching blog post, we can retrieve the contents of the article using the Python requests library and the BeautifulSoup library.

import requests
from bs4 import BeautifulSoup

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

article = soup.find('div', {'class': 'entry-content'}).text

We send a GET request to the URL of the closest matching blog post and retrieve the HTML content. We use the BeautifulSoup library to parse the HTML and extract the contents of the <div> element with the class “entry-content”.

Creating a Prompt for the ChatCompletion API

Now that we have the contents of the blog post, we can create a prompt for the ChatCompletion API. The crucial part here is that our OpenAI query should include the blog post we just retrieved!

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        { "role": "system", "content": "You are a polite assistant" },
        { "role": "user", "content": "Based on the article below, answer the following question: " + your_query +
            "\nAnswer as follows:" +
            "\nHere is the answer directly from the article:" +
            "\nHere is the answer from other sources:" +
             "\n---\n" + article }
           
    ],
    temperature=0,
    max_tokens=200
)

response_text=f"\n{response.choices[0]['message']['content']}"

We use the ChatCompletion API with the gpt-3.5-turbo model to ask our question. This is the same as using ChatGPT on the web with that model. At this point in time, the GPT-4 model was not available yet.

Instead of one prompt, we send a number of dictionaries in a messages list. The first item in the list sets the system message. The second item is the actual user question. We ask to answer the question based on the blog post we stored in the article variable and we provide some instructions on how to answer. We add the contents of the article to our query.

If the article is long, you run the risk of using too many tokens. If that happens, the ChatCompletion call will fail. You can use the tiktoken library to count the tokens and prevent the call to happen in the first place. Or you can catch the exception and tell the user. In the above code, there is no error handling. We only include the core code that’s required.

Returning the Result to a Web Page

If you are running the search code in an HTTP handler as the result of the user typing a query in a web page, you can return the result to the caller:

return jsonify({
    'url': url,
    'response': response_text
})

The full example, including an HTML page and Flask code can be found on GitHub.

The result could look like this:

Query results in the closest URL using vectorized search and ChatGPT answering the question based on the contents the URL points at

Conclusion

Using vectorized search and cosine similarity, we can quickly query a database of blog posts and retrieve the most relevant post. By combining OpenAI’s embeddings API, Pinecone, and the ChatCompletion API, we can create a powerful tool for searching and retrieving blog post content using natural language.

Note that there are some potential issues as well. The code we show is merely a starting point:

  • Limitations of cosine similarity: it does not take into account all properties of the vectors, which can lead to misleading results
  • Prompt engineering: the prompt we use works but there might be prompts that just work better. Experimentation with different prompts is crucial!
  • Embeddings: OpenAI embeddings are trained on a large corpus of text, which may not be representative of the domain-specific language in the posts
  • Performance might not be sufficient if the size of the database grows large. For my blog, that’s not really an issue. 😀