This repo gives you two ways to get a working Box metadata extraction workflow:
- Vibecode your own app using the included
agents.mdin Cursor (with copy/paste prompts), or - Run the included reference implementation in
src/(a complete, working example generated using the sameagents.md).
Both paths produce the same outcome: extract structured fields from Box files via Box AI Extract Structured and write them back as Box metadata.
- Extracts structured metadata from a single file or all files in a folder (i.e., using a metadata template)
- Prints a normalized dict of extracted fields
- Optionally writes extracted fields back to each file as Box metadata
- Uses CCG (Client Credentials Grant) enterprise authentication
- Uses the Box Generated Python SDK (
box-sdk-gen)
In many organizations, important business information lives inside unstructured documents — invoices, contracts, statements, onboarding packets, and internal reports. While Box metadata is powerful for organizing content, enabling search, and driving automation, that metadata is often applied manually or through brittle, document-specific scripts.
This workflow shows how to use Box AI to extract structured data from real documents and write it back to Box as enterprise metadata. By grounding extraction in a shared metadata schema, teams can standardize how information is captured and applied across individual files or entire folders.
Common real-world use cases include:
- Invoices: extracting totals, dates, vendors, and payment terms
- Contracts: capturing effective dates, renewal terms, and counterparties
- Operations workflows: tagging files to trigger downstream automation
- Content governance: enforcing consistent metadata across teams and folders
The key idea is that Box content itself can become an input to downstream systems and processes. By extracting structured fields from documents and applying them as metadata, teams can improve discoverability, enable automation, and power business workflows.
Vibecoding combined with Box metadata extraction creates a fast feedback loop — moving from a shared schema, to a working application, to real metadata written back to Box, turning unstructured files into actionable data.
This repo is intended for developers building content-driven workflows on Box, as well as platform teams looking to standardize metadata extraction at scale.
You will:
- Start with
agents.mdas the specification/guardrails - Use the prompts below in Cursor Chat to generate the code
- Run your generated CLI locally
→ Go to: Vibecoding in Cursor
You will:
- Set up a Python environment
- Configure
.env - Run the CLI that already exists in this repo
→ Go to: Run the reference implementation
Create a new folder (or a new git repo), then copy agents.md into the root of that folder.
Your folder should start like:
your-project/
agents.md
Copy/paste exactly:
Using the instructions in agents.md, scaffold a minimal Python project for this demo.
Create:
- a requirements.txt
- a .env.example
- a basic repo structure under src/
- placeholder files (no full logic yet)
Keep it CLI-first and minimal.
(end)
Expected output:
requirements.txt.env.examplesrc/with placeholder modules (box_client.py,extract.py,metadata.py,cli.py)
Copy/paste exactly:
Implement the single-file happy path end-to-end:
- Finish get_box_client() in src/box_client.py using CCG + python-dotenv
- Implement extract_structured(file_id, template_key, scope, model) in src/extract.py using client.ai.create_ai_extract_structured(...)
- Implement write_metadata(file_id, template_key, metadata_dict) in src/metadata.py using file metadata create (and handle “already exists” gracefully)
- Wire src/cli.py run --file-id ... to call extract → normalize → write-back, with clear step-labeled prints
Follow the guardrails in agents.md strictly (box_sdk_gen, AiItemBase, metadata template, normalization stripping d_ keys). Keep it minimal.
(end)
At this point you should be able to run:
python -m src.cli run --file-id <BOX_FILE_ID> --dry-runCopy/paste exactly:
Using the rules in agents.md (including the “Optional: Folder (batch) processing” section), extend the existing CLI app to support processing a Box folder.
Requirements:
- Preserve the existing single-file flow exactly.
- Add an optional --folder-id flag to the run command.
- Require exactly one of --file-id or --folder-id.
- List folder items via client.folders.get_folder_items(...).
- Only process items where item.type == "file".
- For each file:
- call extract_structured(...)
- print the normalized dict
- if not --dry-run, call write_metadata(...)
- Errors on one file must not stop the batch.
- Print progress like: [2/10] Processing file <id>.
- Keep it minimal and reuse existing functions.
(end)
After Prompt 3, both commands should work:
python -m src.cli run --file-id <BOX_FILE_ID> --dry-run
python -m src.cli run --folder-id <BOX_FOLDER_ID> --dry-runIf you just want a working example (or want to compare your vibecoded output), use the existing code already in src/.
- Python 3.10+
- A Box application configured for Client Credentials Grant (CCG) enterprise auth
- A Box enterprise metadata template (e.g.,
invoice) whose field keys match extracted output
Dependencies:
box-sdk-genpython-dotenv
From the repo root:
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pippip install -r requirements.txtcp .env.example .envEdit .env and set your real values:
BOX_CLIENT_IDBOX_CLIENT_SECRETBOX_ENTERPRISE_IDBOX_METADATA_TEMPLATE_KEYBOX_METADATA_SCOPE(optional; defaults toenterprise_{BOX_ENTERPRISE_ID})BOX_AI_MODEL(optional)
⚠️ Never commit real credentials. Treat.envas secret.
Dry run (recommended first):
python -m src.cli run --file-id <BOX_FILE_ID> --dry-runWrite-back (i.e., tag file with metadata):
python -m src.cli run --file-id <BOX_FILE_ID>Dry run:
python -m src.cli run --folder-id <BOX_FOLDER_ID> --dry-runWrite-back (i.e., tag files with metadata):
python -m src.cli run --folder-id <BOX_FOLDER_ID>Behavior:
- Only file items are processed
- Each file is handled independently
- Errors on one file do not stop the batch
For write-back to succeed:
- The template key must exist in your Box enterprise
- Extracted keys must match the template’s field keys
If keys differ, adjust the template or add a mapping step.
- Confirm your venv is activated
- Reinstall:
pip install -r requirements.txt - Upgrade:
pip install --upgrade box-sdk-gen - Ensure generated imports match
agents.md(do not “guess” alternate module paths)
- Confirm template key and scope
- Ensure fields are not marked required unless they are always present
- Start with files that do not already have the template applied
MIT (see LICENSE)