Skip to content

Latest commit

 

History

History
150 lines (126 loc) · 4.62 KB

File metadata and controls

150 lines (126 loc) · 4.62 KB

Terraform Serverless Vector Ingestion

Transform your documents and ingest them into Qdrant using

Arch Diagram

Quickstart

Prerequisites:

  • awscli installed
  • terraform installed.
  • docker installed.
  • Access to AWS and permissions. If you have a profile configured, and wish to make it default, add AWS_PROFILE=... to the .env.
  • A Qdrant cluster deployed. If you don't have any, create one for free on their website. Keep at hand it the URL and API key.
  • A JinaAI API Key. If you don't, grab one for free from their webpage. Keep the API key close.
  1. Create a ECR repo:
make aws/ecr/login && \
make aws/ecr/create TARGET=qdrant-ingestion
  1. Build the Docker images.
make docker/build TARGET=qdrant-ingestion
  1. Push the Docker image to the ECR repo
make docker/push TARGET=qdrant-ingestion
  1. Create the following secure SSM Parameters:
aws ssm put-parameter \
    --name "/vectorized/qdrant/ecr/image" \
    --value <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/qdrant-ingestion:latest \
    --type "SecureString" \
    --description "ECR iamge for Qdrant ingestor in lambda"

aws ssm put-parameter \
    --name "/vectorized/qdrant/url" \
    --value "<QDRANT_URL>" \
    --type "SecureString" \
    --description "Qdrant URL"

aws ssm put-parameter \
    --name "/vectorized/qdrant/apikey" \
    --value "QDRANT_API_KEY" \
    --type "SecureString" \
    --description "Qdrant API Key"

aws ssm put-parameter \
    --name "/vectorized/jina/apikey" \
    --value "<JINA_API_KEY>" \
    --type "SecureString" \
    --description "Jina AI API Key"
  1. Initialize the terraform stack. Follow one of the two options below. A. Store TFState locally.
    make tf/init TARGET=app ENV=sandbox

B. Store TFState in S3 and Lock in Dynamo. a. Create a bucket in S3 to store the State

aws s3 mb s3://<your-tf-state-bucket>
b. [Optional] Create a dynamodb table (on demand) to store the lock
aws dynamodb create-table \
    --table-name <your-tf-lock-dbtable> \
    --attribute-definitions \
        AttributeName=LockID,AttributeType=S \
    --key-schema \
        AttributeName=LockID,KeyType=HASH \
    --table-class ON_DEMAND
c. Create a file the `./environments/sandbox/app/backend.conf` and write:
bucket="s3://<your-tf-state-bucket>"
key="<path/to>/terraform.tfstate"
region="<your-bucket-region>"
dynamodb_table="<your-tf-lock-dbtable>"
d. Uncoment lines 8-15 of `deployments/app/versions.tf`
    make tf/init TARGET=app ENV=sandbox
  1. Create one more secret to store a UUID namespace. More about this in the section below.
aws ssm put-parameter \
    --name "/vector-ingestion/qdrant/namespace" \
    --value $(uuidgen) \
    --type "SecureString" \
    --description "Namespace UUID4 to ensure key consistency and avoid duplication in Qdrant"
  1. Create a terraform.tfvars file in environments/sandbox/app/ using template.terraform.tfvars as reference. Fill with your values.

  2. Deploy:

make tf/deploy TARGET=app ENV=sandbox
  1. Test your deployment. Drop a file in your S3 bucket. After a moment, check your Qdrant database, you should find the vectorized document's chunks there.

  2. [Opt] Clean up.

aws s3 rm s3://<documents-bucket>/ --recursive
make tf/destroy TARGET=app ENV=sandbox 
aws ecr --repository-name <value> --force 
aws dynamodb delete-table --table-name <your-tf-lock-dbtable> 
aws s3 rm s3://bucket-name/doc --recursive
aws s3 rb s3://<your-tf-state-bucket>
aws ssm delete-parameter --name "/vectorized/qdrant/ecr/image"
aws ssm delete-parameter --name "/vectorized/qdrant/url"
aws ssm delete-parameter --name "/vectorized/qdrant/apikey"
aws ssm delete-parameter --name "/vectorized/jina/apikey"
aws ssm delete-parameter --name "/vectorized/qdrant/namespace"

Note: Namespace

There is a chance that sometime a document is processed twice. If there is no control over how the ID is created, the same chunk created by two different lambdas would have a different ID.

For that we use a namespace.

Setting a namespace, we can ensure getting the same consistent UUID for a two identical document chunks.

Eg:

from uuid import uuid4, uuid5

text = "hello world"

namespace = uuid4()
id1 = uuid5(namespace, text)
id2 = uuid5(namespace, text)

assert id1 == id2

TODO:

  • Test clean up logic.
  • Improve docs for lambda logic.
  • Remove layer creation code, which is not currently being used.