Terraform Serverless Vector Ingestion

Transform your documents and ingest them into Qdrant using

Quickstart

Prerequisites:

awscli installed
terraform installed.
docker installed.
Access to AWS and permissions. If you have a profile configured, and wish to make it default, add AWS_PROFILE=... to the .env.
A Qdrant cluster deployed. If you don't have any, create one for free on their website. Keep at hand it the URL and API key.
A JinaAI API Key. If you don't, grab one for free from their webpage. Keep the API key close.

Create a ECR repo:

make aws/ecr/login && \
make aws/ecr/create TARGET=qdrant-ingestion

Build the Docker images.

make docker/build TARGET=qdrant-ingestion

Push the Docker image to the ECR repo

make docker/push TARGET=qdrant-ingestion

Create the following secure SSM Parameters:

aws ssm put-parameter \
    --name "/vectorized/qdrant/ecr/image" \
    --value <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/qdrant-ingestion:latest \
    --type "SecureString" \
    --description "ECR iamge for Qdrant ingestor in lambda"

aws ssm put-parameter \
    --name "/vectorized/qdrant/url" \
    --value "<QDRANT_URL>" \
    --type "SecureString" \
    --description "Qdrant URL"

aws ssm put-parameter \
    --name "/vectorized/qdrant/apikey" \
    --value "QDRANT_API_KEY" \
    --type "SecureString" \
    --description "Qdrant API Key"

aws ssm put-parameter \
    --name "/vectorized/jina/apikey" \
    --value "<JINA_API_KEY>" \
    --type "SecureString" \
    --description "Jina AI API Key"

Initialize the terraform stack. Follow one of the two options below. A. Store TFState locally.

    make tf/init TARGET=app ENV=sandbox

B. Store TFState in S3 and Lock in Dynamo. a. Create a bucket in S3 to store the State

aws s3 mb s3://<your-tf-state-bucket>

b. [Optional] Create a dynamodb table (on demand) to store the lock

aws dynamodb create-table \
    --table-name <your-tf-lock-dbtable> \
    --attribute-definitions \
        AttributeName=LockID,AttributeType=S \
    --key-schema \
        AttributeName=LockID,KeyType=HASH \
    --table-class ON_DEMAND

c. Create a file the `./environments/sandbox/app/backend.conf` and write:

bucket="s3://<your-tf-state-bucket>"
key="<path/to>/terraform.tfstate"
region="<your-bucket-region>"
dynamodb_table="<your-tf-lock-dbtable>"

d. Uncoment lines 8-15 of `deployments/app/versions.tf`

    make tf/init TARGET=app ENV=sandbox

Create one more secret to store a UUID namespace. More about this in the section below.

aws ssm put-parameter \
    --name "/vector-ingestion/qdrant/namespace" \
    --value $(uuidgen) \
    --type "SecureString" \
    --description "Namespace UUID4 to ensure key consistency and avoid duplication in Qdrant"

Create a terraform.tfvars file in environments/sandbox/app/ using template.terraform.tfvars as reference. Fill with your values.
Deploy:

make tf/deploy TARGET=app ENV=sandbox

Test your deployment. Drop a file in your S3 bucket. After a moment, check your Qdrant database, you should find the vectorized document's chunks there.
[Opt] Clean up.

aws s3 rm s3://<documents-bucket>/ --recursive
make tf/destroy TARGET=app ENV=sandbox 
aws ecr --repository-name <value> --force 
aws dynamodb delete-table --table-name <your-tf-lock-dbtable> 
aws s3 rm s3://bucket-name/doc --recursive
aws s3 rb s3://<your-tf-state-bucket>
aws ssm delete-parameter --name "/vectorized/qdrant/ecr/image"
aws ssm delete-parameter --name "/vectorized/qdrant/url"
aws ssm delete-parameter --name "/vectorized/qdrant/apikey"
aws ssm delete-parameter --name "/vectorized/jina/apikey"
aws ssm delete-parameter --name "/vectorized/qdrant/namespace"

Note: Namespace

There is a chance that sometime a document is processed twice. If there is no control over how the ID is created, the same chunk created by two different lambdas would have a different ID.

For that we use a namespace.

Setting a namespace, we can ensure getting the same consistent UUID for a two identical document chunks.

Eg:

from uuid import uuid4, uuid5

text = "hello world"

namespace = uuid4()
id1 = uuid5(namespace, text)
id2 = uuid5(namespace, text)

assert id1 == id2

TODO:

Test clean up logic.
Improve docs for lambda logic.
Remove layer creation code, which is not currently being used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Terraform Serverless Vector Ingestion

Quickstart

Note: Namespace

TODO:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Terraform Serverless Vector Ingestion

Quickstart

Note: Namespace

TODO: