Finally, it's here! ? Firestore's Full Text Search (Vector Search)

Has the moment finally arrived? The ability to perform full-text searches directly within Firestore—something that previously required the use of third-party services—is now a reality!

Assumption
Overview of Steps
Creating a Google Cloud Function
Configuration of Google’s Vertex AI
Calculation of Vectors for Vector Search
Creation of Index for Vector Search
Execution Code for Vector Search in Firestore
Conclusion

Assumption

This article introduces the implementation of full-text search using vector search in Firestore. It’s important to note that the feature discussed here is currently in preview and, as such, the code provided may become inexecutable upon official release. For those interested in the technical specifics of vector search, Google provides comprehensive documentation which can serve as a reference.
official：https://firebase.google.com/docs/firestore/vector-search

The function and Vertex AI used in this process will incur costs. Ensure you check the pricing and understand the billing implications before proceeding.
Vertex AI：https://cloud.google.com/vertex-ai/generative-ai/pricing?hl=en

Overview of Steps

Though a separate article will detail vector search, this guide focuses on using
Google’s Vertex AI for calculating vectors essential for vector search.
Currently, vector searches can only be conducted using Python or JavaScript (Node.js),
necessitating the use of Google Cloud Function for execution.

The necessary preparations include:

Creating a Google Cloud Function
Configuring Vertex AI
Acquiring Vectors
Creating an Index
Executing Vector Search in Firestore

Each of these steps is elaborated upon below.

Creating a Google Cloud Function

The environment for the Cloud Functions created for this tutorial is as follows:

2nd gen
Https Trigger
Python3.12

It is crucial to set the runtime environment variable

　name：GOOGLE_CLOUD_PROJECT
　value：<project ID> (Note that this is not the project name)

We are using the 2nd generation (2nd gen) because it creates Cloud Run upon GCF creation, which is necessary for linking with Vertex AI.

Configuration of Google’s Vertex AI

Link Vertex AI to Cloud Run of the Function.
(Reference official: https://cloud.google.com/run/docs/integrate/vertex-ai?authuser=3&hl=en)

Here are the steps for your reference:

Click the link under “Powered by Cloud Run” on the right side and move to Cloud Run.
Click the “Integrations” tab.
Click “Add integration.”
Click “Vertex AI – Generative AI,” set an arbitrary name, and click “submit.”
Note that the name must follow certain rules, or it will result in an error. If you have no specific preference, the default value is fine.
Approve if additional permissions are requested.

Calculation of Vectors for Vector Search

This time, since I am operating with owner permissions, no additional permissions have been set. However, during app development, it may be necessary to assign permissions for Vertex AI and Firestore to the account executing the Function.

Below is the code for calculating vectors using Vertex AI and storing data in Firestore. (Reference official: https://firebase.google.com/docs/firestore/vector-search)

functions-framework==3.*
google-cloud-firestore
google-cloud-aiplatform

import functions_framework
import os
# firestore
from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector
# Vertex AI
import vertexai
from vertexai.language_models import TextEmbeddingModel


# Project name (obtained from the environment)
MY_PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") 

# Calculate vector values from the given text
def text_embedding(text: str) -> list:

    # Set your own location
    vertexai.init(project=MY_PROJECT_ID, location="asia-northeast1") 

　　# Use the latest AI for vector calculation, "textembedding-gecko@003"
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
    embeddings = model.get_embeddings([text])
    for embedding in embeddings:
        vector = embedding.values

    return Vector(vector)


# Main process
# (The function name is arbitrary)
@functions_framework.http
def hello_http(request):

    # Retrieve the article summary (description) from the request
    request_json = request.get_json(silent=True)
    request_args = request.args

    if request_json and 'description' in request_json:
        description = request_json['description']
    elif request_args and 'description' in request_args:
        description = request_args['description']
    else:
        description = 'World'

    # Initialize Firestore client
    firestore_client = firestore.Client(project=MY_PROJECT_ID)
    # Reference to a collection (collection name is optional)
    # if you don't have a collection, please create one in advance
    collection = firestore_client.collection("article_collection")

    # Calculate embedding
    embedding_vector = text_embedding(description)

    # Prepare the document to add to Firestore
    doc = {
        "description": description,
        "embedding_field": embedding_vector
    }
    # Add the document
    collection.add(doc)

    return 'OK!'

For simplicity, I executed the CLI test command from the terminal to verify its operation.

curl -m 70 -X POST https://asia-northeast1-python-tool-001.cloudfunctions.net/vector_chenge 
  -H "Authorization: bearer $(gcloud auth print-identity-token)" 
  -H "Content-Type: application/json" 
  -d '{ "description": "<Any string>"}'

If the execution is successful, you should find the data stored in Firestore as follows:

Creation of Index for Vector Search

Creating an index seems essential for vector search. This time, I executed the following command from the console to create an index.
(Reference official: https://firebase.google.com/docs/firestore/vector-search)

gcloud alpha firestore indexes composite create 
  --collection-group=article_collection 
  --query-scope=COLLECTION 
  --field-config field-path=embedding_field,vector-config='{"dimension":"768", "flat": "{}"}' 
  --database=<DatabaseID>

collection-group: The name of the collection for which to create the index
query-scope: Not entirely clear, but it seems to be the scope for creating the index, like specifying a range that includes multiple collections (collection groups).
field-path: The name of the field where vectors are stored
vector-config: Set the dimension number of the vector in dimension (this time it was 768 dimensions)
database: Specify the ID of the target database. This specification is not necessary if it’s the default.

Following execution, an index will be created in Firestore as below.

Execution Code for Vector Search in Firestore

Now that the data preparation is complete, let’s proceed with the actual search.

For this demonstration, I prepared summaries of my blog articles as search target data. Due to length, some parts are omitted here.

The executed code is as follows:

No.	タイトル
1	What to do when freezed.dart is not created When designing an immutable class using freezed, you may get a terminal “…
2	What is Flutter’s pubspec.yaml? What it means and how to write it! YAML stands for YAML Ain’t Markup Language, a concise representation of data…
3	What is MVVM, which we often hear about in app development? MVVM (Model-View-ViewModel) is the combination of an app’s logic and UI (user interface)…
4	What is Flutter? An overview of Flutter What is Flutter? Flutter” is “useful for developing mobile apps…
5	What is Riverpod, Flutter’s most major state management introduction! StatefulWidget, which we introduced before, is one of the functions that performs state management…

The executed code is as follows:

import functions_framework
import os
# firestore
from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
# Vertex AI
import vertexai
from vertexai.language_models import TextEmbeddingModel


# Project name (obtained from the environment)
MY_PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") 

# Calculate vector values from the given text
def text_embedding(text: str) -> list:

    # Set your own location
    vertexai.init(project=MY_PROJECT_ID, location="asia-northeast1") 

　　# Use the latest AI for vector calculation, "textembedding-gecko@003"
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
    embeddings = model.get_embeddings([text])
    for embedding in embeddings:
        vector = embedding.values

    return Vector(vector)


# Main process
# (The function name is arbitrary)
@functions_framework.http
def hello_http(request):

    # Retrieve the target text from the request
    request_json = request.get_json(silent=True)
    request_args = request.args

    if request_json and 'target' in request_json:
        target = request_json['target']
    elif request_args and 'target' in request_args:
        target = request_args['target']
    else:
        target = 'World'

     # Initialize Firestore client
    firestore_client = firestore.Client(project=MY_PROJECT_ID)
    # Reference to the collection
    collection = firestore_client.collection("article_collection")

    # Calculate embedding
    embedding_vector = text_embedding(target)

    # Conduct vector search
    docs = collection.find_nearest(
        vector_field="embedding_field",
        query_vector=embedding_vector,
        distance_measure=DistanceMeasure.COSINE,
        limit=3
    ).get()

    # For output in table format (here, string format)
    output = "Description n"
    output += "-" * 50 + "n"
    
    # Output the contents of documents obtained from vector search
    for doc in docs:
        doc_data = doc.to_dict()
        description = doc_data.get("description", "No description")
        # Add document content to string
        output += f"{description[:100]} n"
    
    return output

Similarly, I executed the CLI test command from the terminal to verify its operation. Let’s try searching for “About Riverpod”!

curl -m 70 -X POST https://asia-northeast1-python-tool-001.cloudfunctions.net/vector_search 
-H "Authorization: bearer $(gcloud auth print-identity-token)" 
-H "Content-Type: application/json" 
-d '{
  "target": "About Riverpod"
}'

Execution Results

Description 
--------------------------------------------------
What is Riverpod, Flutter's most major state management introduction! StatefulWidget, which we introduced before, is one of the functions that performs state management…
What is Flutter? An overview of Flutter What is Flutter? Flutter" is "useful for developing mobile apps…
What is MVVM, which we often hear about in app development? MVVM (Model-View-ViewModel) is the combination of an app's logic and UI (user interface)…

The first article about Riverpod appeared first as expected! It’s unclear if the second and third articles were closely related due to the data used in this test.

Conclusion

Increasing the data volume might improve search accuracy for further testing. Considering the data size of vectors as 4 bytes per float, for 768 dimensions, it equals approximately 3KB. Given that the document limit is 1MB, this might seem slightly large.

The ability to perform full-text search in Firestore, even in its preview version, is certainly good news. Looking forward to future developments!