드디어 출시되었습니다! ? Firestore의 전체 텍스트 검색(벡터 검색)

지금까지는 제3자 서비스를 이용하지 않으면 안 되었던 Firestore에서 드디어 전문 검색이 가능해졌습니다!!

벡터 검색을 통한 전문 검색인데, 이번에는 그 실행 절차를 소개합니다.

전제
대략적인 절차
Google Cloud Function의 생성
Google이 제공하는 Vertex AI 설정
벡터 검색에 사용되는 벡터 계산
벡터 검색을 위한 인덱스 생성
Firestore에서의 벡터 검색 실행 코드
실행 결과
마지막으로

전제

이번에 소개하는 기능은 프리뷰 버전입니다.
정식 릴리스에 맞춰, 아래 코드는 실행할 수 없게 될 가능성이 있습니다.
참고) 벡터 임베딩을 통한 검색: https://firebase.google.com/docs/firestore/vector-search

이번에 사용하는 Function이나 Vertex AI는 사용료가 발생합니다.
이용 시 사전에 확인하세요.
참고) Vertex AI: https://cloud.google.com/vertex-ai/generative-ai/pricing?hl=ko

대략적인 절차

벡터 검색에 대해서는 별도의 기사를 작성할 예정이지만,
이번에는 벡터 검색에 사용하는 벡터의 계산에 Google이 제공하는 ‘Vertex AI’를 사용합니다.
또한, 벡터 검색은 현재, Python 또는 JavaScript(Node.js)로만 수행할 수 있기 때문에,
이번에는 Google Cloud Function을 사용하여 검색을 진행합니다.

그러므로, 검색을 수행하기 위해 아래 준비가 필요합니다.

Google Cloud Function의 생성
Vertex AI 설정
벡터의 취득
인덱스 생성
Firestore에서의 벡터 검색 실행 코드

각각 자세한 절차를 설명합니다.

Google Cloud Function의 생성

이번에 생성한 Cloud Functions의 환경은 아래와 같습니다.

2nd gen(제2세대)
https 트리거
Python3.12

또한 런타임 환경 변수에 아래를 설정했습니다.

이름: GOOGLE_CLOUD_PROJECT
값: <프로젝트 ID>(프로젝트 이름이 아닌 것에 주의)

Vertex AI는 Cloud Run과 연결해야 하기 때문에, GCF 생성 시 Cloud Run이 생성되는,
2nd gen(제2세대)를 사용하고 있습니다.

Google이 제공하는 Vertex AI 설정

Function의 Cloud Run에 Vertex AI를 연결합니다.
(참고 공식: https://cloud.google.com/run/docs/integrate/vertex-ai?authuser=3&hl=ko)

참고로 여기에도 절차를 올려둡니다.

우측에 있는 ‘Powered by Cloud Run’ 아래 링크를 클릭하여 Cloud Run으로 이동
‘통합’ 탭을 클릭
‘통합 추가’를 클릭
‘Vertex AI – 생성 AI’를 클릭하고 임의의 이름을 설정한 후 ‘submit’을 클릭
※단, 이름은 어느 정도 규칙에 맞는 이름이 아니면 에러가 발생합니다.
특별한 선호가 없는 경우 초기값으로 OK입니다.
권한 등의 추가를 요구받는 경우 승인

벡터 검색에 사용되는 벡터 계산

이번에는 소유권을 가지고 실행하고 있기 때문에, 권한 추가 등은 진행하지 않았지만,
앱 개발 시 등은 Function을 실행하는 계정에 Vertex AI나 Firestore의 권한을
할당해야 합니다.

아래는 Vertex AI를 사용하여 벡터를 계산하고, Firestore에 데이터를 저장하는 코드입니다.
(참고 공식: https://firebase.google.com/docs/firestore/vector-search)

functions-framework==3.*
google-cloud-firestore
google-cloud-aiplatform

import functions_framework
import os
# firestore
from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector
# Vertex AI
import vertexai
from vertexai.language_models import TextEmbeddingModel

# 프로젝트명(환경 변수에서 취득)
MY_PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") 

# 전달된 문자열에서 벡터 값을 계산
def text_embedding(text: str) -> list:

    # location은 각자의 로케이션을 설정
    vertexai.init(project=MY_PROJECT_ID, location="asia-northeast1") 

    # 현재 벡터 계산을 위한 최신 AI가 'textembedding-gecko@003'이므로 이를 사용
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
    embeddings = model.get_embeddings([text])
    for embedding in embeddings:
        vector = embedding.values

    return Vector(vector)

# 메인 처리
# (함수명은 적당히)
@functions_framework.http
def hello_http(request):

    # 요청에서 기사 요약(description)을 취득하고 있습니다
    request_json = request.get_json(silent=True)
    request_args = request.args

    if request_json and 'description' in request_json:
        description = request_json['description']
    elif request_args and 'description' in request_args:
        description = request_args['description']
    else:
        description = 'World'

    # Firestore 클라이언트 초기화
    firestore_client = firestore.Client(project=MY_PROJECT_ID)
    # 컬렉션 참조(컬렉션명은 임의(컬렉션이 없는 경우 사전에 생성해주세요))
    collection = firestore_client.collection("article_collection")

    # embedding을 계산
    embedding_vector = text_embedding(description)

    # Firestore에 추가할 문서 준비
    doc = {
        "description": description,
        "embedding_field": embedding_vector
    }
    # 문서 추가
    collection.add(doc)

    return 'OK!'

이번에는 번거로웠기 때문에 터미널에서, CLI 테스트 명령을 실행하여 작동을 확인했습니다.

curl -m 70 -X POST https://asia-northeast1-python-tool-001.cloudfunctions.net/vector_chenge \
  -H "Authorization: bearer $(gcloud auth print-identity-token)" \
  -H "Content-Type: application/json" \
  -d '{ "description": "<임의의 문자열>"}'

실행이 성공하면, Firestore에 아래와 같이 데이터가 저장되어 있는 것을 볼 수 있습니다.

벡터 검색을 위한 인덱스 생성

벡터 검색에는 인덱스 생성이 필수인 것 같습니다.
이번에는 콘솔에서 아래 명령을 실행하여 인덱스를 생성했습니다.
(참고 공식: https://firebase.google.com/docs/firestore/vector-search)

gcloud alpha firestore indexes composite create \
  --collection-group=article_collection \
  --query-scope=COLLECTION \
  --field-config field-path=embedding_field,vector-config='{"dimension":"768", "flat": "{}"}' \
  --database=<데이터베이스 ID>

collection-group: 인덱스를 생성할 컬렉션명
query-scope: 여기는 모르겠지만 인덱스를 생성하는 스코프
다수의 컬렉션(컬렉션 그룹) 등을 범위에 지정할 수 있는 것 같습니다.
field-path: 벡터를 저장하고 있는 필드명
vector-config: dimension에 벡터의 차원 수를 설정(이번에는 768차원이었음)
database: 대상이 되는 데이터베이스의 ID를 지정. default인 경우 이 지정은 필요 없음

실행하면 아래와 같이 Firestore에 인덱스가 생성됩니다.

Firestore에서의 벡터 검색 실행 코드

데이터 준비가 완료되었으므로, 실제로 검색을 진행합니다.

이번에는 제 블로그 기사 요약 내용을 검색 대상 데이터로 준비했습니다.
전문을 표시하면 많으므로 일부 생략합니다.

No.	タイトル
1	What to do when freezed.dart is not created When designing an immutable class using freezed, you may get a terminal “…
2	What is Flutter’s pubspec.yaml? What it means and how to write it! YAML stands for YAML Ain’t Markup Language, a concise representation of data…
3	What is MVVM, which we often hear about in app development? MVVM (Model-View-ViewModel) is the combination of an app’s logic and UI (user interface)…
4	What is Flutter? An overview of Flutter What is Flutter? Flutter” is “useful for developing mobile apps…
5	What is Riverpod, Flutter’s most major state management introduction! StatefulWidget, which we introduced before, is one of the functions that performs state management…

실행한 코드는 아래와 같습니다.

import functions_framework
import os
# firestore
from google.cloud.firestore import firestore
from google.cloud.firestore_v1.vector import Vector
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
# Vertex AI
import vertexai
from vertexai.language_models import TextEmbeddingModel

# 프로젝트명(환경 변수에서 취득)
MY_PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") 

# 전달된 문자열에서 벡터 값을 계산
def text_embedding(text: str) -> list:

    # location은 각자의 로케이션을 설정
    vertexai.init(project=MY_PROJECT_ID, location="asia-northeast1") 

    # 현재 벡터 계산을 위한 최신 AI가 'textembedding-gecko@003'이므로 이를 사용
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
    embeddings = model.get_embeddings([text])
    for embedding in embeddings:
        vector = embedding.values

    return Vector(vector)

# 메인 처리
# (함수명은 적당히)
@functions_framework.http
def hello_http(request):

    # 요청에서 기사 요약(description)을 취득하고 있습니다
    request_json = request.get_json(silent=True)
    request_args = request.args

    if request_json and 'target' in request_json:
        target = request_json['target']
    elif request_args and 'target' in request_args:
        target = request_args['target']
    else:
        target = 'World'

    # Firestore 클라이언트 초기화
    firestore_client = firestore.Client(project=MY_PROJECT_ID)
    # 컬렉션 참조
    collection = firestore_client.collection("article_collection")

    # embedding을 계산
    embedding_vector = text_embedding(target)

    # 벡터 검색 실행
    docs = collection.find_nearest(
        vector_field="embedding_field",
        query_vector=embedding_vector,
        distance_measure=DistanceMeasure.COSINE,
        limit=3
    ).get()

    # 표 형식(여기서는 문자열 형식)으로의 출력용
    output = "Description \n"
    output += "-" * 50 + "\n"
    
    # 벡터 검색으로 취득한 문서의 내용을 출력
    for doc in docs:
        doc_data = doc.to_dict()
        description = doc_data.get("description", "No description")
        # 문서 내용을 문자열에 추가
        output += f"{description[:100]} \n"
    
    return output

여기도 터미널에서, CLI 테스트 명령을 실행하여 작동을 확인했습니다.
‘About Riverpod’ 검색을 실행해 봅니다!

curl -m 70 -X POST https://asia-northeast1-python-tool-001.cloudfunctions.net/vector_search 
-H "Authorization: bearer $(gcloud auth print-identity-token)" 
-H "Content-Type: application/json" 
-d '{
  "target": "About Riverpod"
}'

실행 결과

Description 
--------------------------------------------------
What is Riverpod, Flutter's most major state management introduction! StatefulWidget, which we introduced before, is one of the functions that performs state management…
What is Flutter? An overview of Flutter What is Flutter? Flutter" is "useful for developing mobile apps…
What is MVVM, which we often hear about in app development? MVVM (Model-View-ViewModel) is the combination of an app's logic and UI (user interface)…

마지막으로

데이터 양을 조금 더 늘려서 검증해 보는 것이 검색 정확도를 검증할 수 있을 것 같습니다.
벡터의 데이터 크기는 float를 4바이트로 할 경우, 이번 경우에는 768차원이므로 768✕4≒3KB입니다.
문서의 제한이 1MB이기 때문에 조금 크게 느껴질 수도 있습니다.

Firestore에서 전문 검색이 가능해진 것은, 아직 프리뷰 버전이라고는 하지만 좋은 소식입니다.
앞으로의 동향도 기대됩니다.