Skip to content

panosAthDBX/synthetic_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit Risk on Databricks

This repository packages a governed credit-risk quick start for Databricks. It complements the specification in Agents.MD with runnable code, Databricks Asset Bundles, and documentation that teams can copy, customise, and promote into their own regulated environments.

The implementation covers the core lifecycle:

  • Ingestion of UK Open Banking feeds into Unity Catalog bronze/silver layers
  • Feature engineering with strict point-in-time joins and Feature Serving registration
  • Synthetic data generation governed by policy prompts, priors, and privacy checks
  • Model training, evaluation, and serving using Mosaic AI Model Serving and automatic feature lookup
  • Monitoring and incident response across data quality, serving SLOs, and fairness metrics

ℹ️ Open source quick start: The repository is intentionally verbose. Each configuration knob and operational step is documented so that downstream teams can see what must be adapted for their bank, credit bureau, or fintech deployment.

Table of contents

  1. Repository layout
  2. Prerequisites
  3. Configuring the platform
  4. Local setup and tests
  5. Publishing agents & feature specs
  6. Deploying with Databricks Asset Bundles
  7. Running the workflow
  8. Customising the solution
  9. Reference documentation

Repository layout

.
├── docs/                  # How-to guides (deployment, operations, governance, Mosaic plan)
├── infra/                 # Databricks Asset Bundle definitions and UC policy scripts
├── risk_platform/
│   ├── agents/            # Prompt policy, memo, synthetic, feature, modeling, serving, monitoring agents
│   ├── config/            # Typed configuration (Unity Catalog, MLflow, Lakeflow, operations)
│   ├── governance/        # UC policy helpers
│   ├── ingestion/         # Data contracts and Lakeflow declarative pipeline builders
│   ├── models/            # Training and evaluation utilities
│   ├── ops/               # Incident notifier (webhook)
│   ├── serving/           # Feature Serving registrar helpers
│   ├── synthetic/         # LLM wrappers, priors, agents, specs for synthetic data
│   ├── orchestration/     # Orchestrator entry points / CLI
│   └── ...
└── tests/                 # Pytest suites covering contracts, features, agents, orchestration

Secrets, PATs, and production configuration are never committed. Use Unity Catalog secret scopes and environment variables when deploying.

Architecture overview

flowchart LR
    subgraph Ingestion
        A[Open Banking APIs] --> B[Lakeflow Bronze]
        B --> C[Lakeflow Silver]
    end
    subgraph Governance
        C --> D[Unity Catalog Policies]
        D --> E[Synthetic Agent]
    end
    E --> F[Feature Agent]
    F --> G[Feature Serving]
    G --> H[Mosaic AI Model Serving]
    H --> I[Credit Decisions]
    subgraph Monitoring
        C --> J[Lakehouse Monitoring]
        H --> J
        J --> K[Monitoring Agent]
        K --> L[Incident Webhook]
    end
    E --> M[MLflow Runs]
    F --> M
    H --> M
Loading

Data and agent flow

sequenceDiagram
    participant OB as Open Banking Source
    participant LF as Lakeflow Pipelines
    participant UC as Unity Catalog
    participant AG as Mosaic Agent Plan
    participant FS as Feature Serving
    participant MS as Mosaic Serving Endpoint
    participant MON as Monitoring & Alerts

    OB->>LF: Deliver accounts/balances/transactions
    LF->>UC: Write bronze & silver tables
    AG->>UC: PromptPolicyAgent validates prompt
    AG->>UC: SyntheticDataAgent produces synthetic tables
    AG->>UC: FeatureAgent writes features & registers FeatureSpecs
    AG->>FS: FeatureAgent binds FeatureSpecs to endpoint
    AG->>UC: ModelingAgent trains model
    AG->>UC: EvaluationAgent validates metrics & fairness
    AG->>MS: ServingAgent deploys candidate model
    MS->>MON: Emit latency/availability metrics
    UC->>MON: Lakehouse Monitoring tables
    MON->>AG: MonitoringAgent evaluates thresholds
    MON->>Incident: Webhook notification (Slack/PagerDuty)
Loading

Prerequisites

Before running the quick start, ensure the following are available in your Databricks workspace:

  • Python 3.10+ runtimes for dev, stage, and prod clusters. The codebase uses dataclass(slots=True) and other 3.10 features.
  • Unity Catalog with dedicated catalogs/schemas (default names: risk_dev, risk_stage, risk_prod).
  • Lakehouse Federation / Lakeflow for Open Banking ingestion.
  • Mosaic AI services:
    • LLM endpoints for planning, synthetic generation, memo normalisation, and prompt policy
    • Llama Guard safety endpoint
    • Mosaic AI Model Serving endpoint for risk_prod.credit.score_endpoint
  • Feature Serving workspace entitlement and permissions to register FeatureSpecs.
  • Secret scopes containing at minimum: Databricks PAT (secrets/risk-platform/token), external aggregator credentials, webhook credentials for incident notifications.

Configuring the platform

Settings live in risk_platform/config/settings.py and default to the values described in Agents.MD. Override them using environment variables (prefix RISK_, nested with __). Examples:

export RISK_UNITY_CATALOG__CATALOG=risk_stage
export RISK_UNITY_CATALOG__SCHEMAS__CREDIT=credit
export RISK_OPERATIONS__INCIDENT_WEBHOOK_URL=https://hooks.slack.com/services/... 
export RISK_ORCHESTRATOR__AGENTS='["prompt_policy_agent","synthetic_data_agent","feature_agent",...]'

Key items to review and customise:

  • Unity Catalog mapping: Align catalog and schemas with your tenancy’s naming convention.
  • MLflow tracking URIs: Provide environment-specific tracking/registry URIs.
  • Lakeflow scopes and bundle paths: Update connect_config_scope and pipeline bundle locations.
  • Feature Serving endpoint: Set RISK_DEFAULT_SERVING_ENDPOINT to the endpoint you bind feature specs to.
  • Operations: Supply incident webhook URL/token if you want the monitoring agent to page on-call.

🛠️ Tip: Ship a .env per environment and load it in your CI/CD pipeline before executing bundle commands.

Local setup and tests

  1. Clone the repository and create a Python 3.10+ virtual environment.
  2. Install dependencies:
    pip install -e .[dev]
  3. Run the fast tests (contracts, features, agents) locally:
    pytest tests/
    Some tests rely on Spark/PySpark. When running outside Databricks, install pyspark and set SPARK_HOME, or mark them with -m "not spark".
  4. Export a Mosaic plan (used later during deployment):
    python -m risk_platform.agents.export_plan > plan.json

Publishing agents & feature specs

  1. Register Mosaic agents using the Databricks CLI / Mosaic plugin:
    databricks mosaic agents create \
      --name risk_platform.credit.prompt_policy \
      --file risk_platform/agents/prompt_policy_agent.py \
      --class PromptPolicyAgent
    # repeat for memo, synthetic, feature, modeling, evaluation, serving, monitoring
  2. Register the Mosaic plan exported earlier:
    databricks mosaic plans create \
      --name risk_platform.credit.plan \
      --file plan.json
  3. Verify Feature Serving bindings after running the FeatureAgent once:
    curl -H "Authorization: Bearer $DATABRICKS_TOKEN" \
      "https://<workspace>/api/2.0/feature-serving/feature-specs/risk_prod.credit.features_txn_7d"
    The response should list the feature columns and show the serving endpoint binding under serving-specs.

Deploying with Databricks Asset Bundles

  1. Configure environment variables for the target workspace (see docs/DEPLOYMENT.md for the full set).
  2. Build the wheel and deploy:
    python -m build
    databricks bundle deploy -o infra/dabs/agent_bundle.yaml -t dev
  3. Run post-deploy jobs:
    databricks bundle run uc-policy -t dev   # apply UC tags & row filters
    databricks bundle run lakehouse-monitoring -t dev
    databricks bundle run orchestrator -t dev
  4. Promote to stage/prod by re-running with the appropriate -t target after updating endpoint/model/catalog variables.

Every bundle target expects the following to exist: Mosaic plan + agents registered, secret scope with PAT, Unity Catalog row filter/mask functions, Lakeflow connectors.

Running the workflow

  • Scheduled orchestration: Attach jobs/run_orchestrator.py to a Databricks job pointing at the Mosaic plan. Provide ORCHESTRATOR_GOAL/ORCHESTRATOR_CONTEXT if you need ad hoc runs.
  • Agent telemetry: Each agent run creates a nested MLflow run with inputs, outputs, and metadata, simplifying risk/governance reviews.
  • Feature Serving: The FeatureAgent writes Delta tables, registers FeatureSpecs via REST, and binds them to RISK_DEFAULT_SERVING_ENDPOINT. Registration status is returned in the agent details.
  • Incident routing: MonitoringAgent polls serving metrics and Lakehouse Monitoring tables. If you set RISK_OPERATIONS__INCIDENT_WEBHOOK_URL, PagerDuty/Slack (or other webhook targets) receive JSON payloads describing the breach.

Customising the solution

  • Agents: Extend the existing agents or add new ones by updating risk_platform/agents/factory.py and RISK_ORCHESTRATOR__AGENTS. Make sure to register the new agent in Mosaic and include it in the plan.
  • Feature logic: Modify risk_platform/features/feature_views.py and adjust window definitions in default_feature_agent_config. Regenerate specs by re-running the FeatureAgent.
  • Lakeflow pipelines: Update risk_platform/ingestion/pipelines.py and the declarative bundle under ingestion/. Keep expectations aligned with the data contracts in risk_platform/data_contracts.
  • Governance policies: Adjust infra/uc/apply_uc_policies.py to plug in your row filters, masking functions, and tag vocabulary.
  • Operations: Point the incident notifier at your on-call system, and update runbooks in docs/RUNBOOK.md to reflect your escalation path.

Reference documentation

For additional Databricks references:

Happy building!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages