diff --git a/use-case-examples/Long-Context-Document-Analysis-with-Nemotron-Super/README.md b/use-case-examples/Long-Context-Document-Analysis-with-Nemotron-Super/README.md new file mode 100644 index 000000000..3a6cacd10 --- /dev/null +++ b/use-case-examples/Long-Context-Document-Analysis-with-Nemotron-Super/README.md @@ -0,0 +1,53 @@ +# Long-Context Document Analysis with Nemotron 3 Super + +Analyze large document collections in a single context window using Nemotron 3 Super's native 1M-token context capability - no chunking, no RAG pipeline required. + +## Overview + +This example demonstrates how to leverage Nemotron 3 Super's 1M native context window for practical document analysis tasks, progressing from single-document summarization to full cross-document synthesis: + +1. **Document Corpus Construction** - Build a realistic multi-document corpus for analysis +2. **Single-Document Analysis** - Summarize and extract key findings from individual documents +3. **Multi-Document Q&A** - Answer questions that require synthesizing information across multiple documents +4. **Cross-Document Synthesis** - Identify themes, contradictions, and connections across the full corpus +5. **Context Length Scaling** - Compare quality and latency at different context sizes (32K, 128K, 256K) + +## Models Used + +| Component | Model | Parameters | Context Window | Deployment | +|-----------|-------|------------|----------------|------------| +| **Document Analysis** | `nvidia/nemotron-3-super-120b-a12b` | 120B total / 12B active | **1M tokens (native)** | NVIDIA API or self-hosted (vLLM) | + +## Why Nemotron 3 Super for Long-Context Tasks? + +- **1M native context window** - Trained with long-context extension via CPT methodology +- **Outperforms on RULER at 1M** - Higher accuracy than GPT-OSS and Qwen3.5 at maximum context +- **8x the context of Qwen 3.5** (128K) - Process entire codebases, legal corpora, or research collections +- **Hybrid Mamba-Transformer MoE** architecture maintains quality across long sequences +- **5x throughput improvement** - Process large documents efficiently + +## Requirements + +- Python 3.10+ +- NVIDIA API Key ([get one here](https://build.nvidia.com/)) + +## Quick Start + +```bash +# Install dependencies +pip install openai tiktoken + +# Set your API key +export NVIDIA_API_KEY="your-key-here" + +# Run the notebook +jupyter notebook long_context_analysis_tutorial.ipynb +``` + +## What You'll Learn + +- How to structure prompts for effective long-context document analysis +- Building multi-document corpora that fit within context windows +- Techniques for cross-document synthesis without RAG +- How context length affects response quality and latency +- Best practices for instruction placement in long contexts diff --git a/use-case-examples/Long-Context-Document-Analysis-with-Nemotron-Super/long_context_analysis_tutorial.ipynb b/use-case-examples/Long-Context-Document-Analysis-with-Nemotron-Super/long_context_analysis_tutorial.ipynb new file mode 100644 index 000000000..9701998b0 --- /dev/null +++ b/use-case-examples/Long-Context-Document-Analysis-with-Nemotron-Super/long_context_analysis_tutorial.ipynb @@ -0,0 +1,808 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Long-Context Document Analysis with Nemotron 3 Super\n", + "\n", + "This notebook demonstrates how to use Nemotron 3 Super's **1M native context window** for practical document analysis tasks. Instead of chunking documents and building RAG pipelines, we load entire document collections into a single context window and let the model reason across all the content at once.\n", + "\n", + "## Models Used\n", + "\n", + "| Component | Model | Parameters | Context Window |\n", + "|-----------|-------|------------|----------------|\n", + "| **Document Analysis** | `nvidia/nemotron-3-super-120b-a12b` | 120B total / 12B active | **1M tokens (native)** |\n", + "\n", + "## What We'll Cover\n", + "\n", + "1. **Setup** - Install dependencies and configure the API\n", + "2. **Build Document Corpus** - Create a realistic multi-document corpus\n", + "3. **Single-Document Analysis** - Summarize and extract findings\n", + "4. **Multi-Document Q&A** - Answer questions requiring cross-document reasoning\n", + "5. **Cross-Document Synthesis** - Identify themes and contradictions\n", + "6. **Context Length Scaling** - Compare behavior at different context sizes\n", + "7. **Best Practices** - Prompt strategies for long contexts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Setup\n", + "\n", + "Install the required packages and configure the NVIDIA API endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q openai tiktoken" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import time\n", + "import json\n", + "import tiktoken\n", + "from openai import OpenAI\n", + "\n", + "# Configure the client for NVIDIA API\n", + "# Alternative: Use a self-hosted vLLM server by changing base_url\n", + "client = OpenAI(\n", + " base_url=\"https://integrate.api.nvidia.com/v1\",\n", + " api_key=os.environ.get(\"NVIDIA_API_KEY\", \"your-key-here\"),\n", + ")\n", + "\n", + "MODEL = \"nvidia/llama-3.1-nemotron-ultra-253b-v1\" # Production model ID\n", + "# For self-hosted vLLM:\n", + "# MODEL = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16\"\n", + "\n", + "# Token counter for measuring context usage\n", + "enc = tiktoken.get_encoding(\"cl100k_base\")\n", + "\n", + "def count_tokens(text: str) -> int:\n", + " \"\"\"Approximate token count using cl100k_base encoding.\"\"\"\n", + " return len(enc.encode(text))\n", + "\n", + "def timed_completion(messages, max_tokens=2048, temperature=0.3):\n", + " \"\"\"Send a chat completion and return the response with timing info.\"\"\"\n", + " start = time.time()\n", + " response = client.chat.completions.create(\n", + " model=MODEL,\n", + " messages=messages,\n", + " max_tokens=max_tokens,\n", + " temperature=temperature,\n", + " )\n", + " elapsed = time.time() - start\n", + " content = response.choices[0].message.content\n", + " usage = response.usage\n", + " print(f\"Prompt tokens: {usage.prompt_tokens:,}\")\n", + " print(f\"Completion tokens: {usage.completion_tokens:,}\")\n", + " print(f\"Wall time: {elapsed:.1f}s\")\n", + " print(f\"Tokens/sec: {usage.completion_tokens / elapsed:.0f}\")\n", + " return content\n", + "\n", + "print(\"Setup complete.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Build Document Corpus\n", + "\n", + "We create a synthetic corpus of technical design documents for a fictional distributed database project called **\"AuroraDB\"**. Each document covers a different subsystem, with deliberate cross-references, shared terminology, and a few intentional contradictions that only surface when reading multiple documents together.\n", + "\n", + "This simulates a real-world scenario: an engineer joining a project and needing to understand the full system from its documentation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "DOCUMENTS = {\n", + " \"architecture_overview.md\": \"\"\"# AuroraDB Architecture Overview\n", + "\n", + "## Introduction\n", + "\n", + "AuroraDB is a distributed NewSQL database designed for global-scale transactional workloads. It combines the ACID guarantees of traditional relational databases with the horizontal scalability of NoSQL systems. The system is deployed across 14 data centers on 5 continents, serving over 2 million queries per second at peak load.\n", + "\n", + "## Core Design Principles\n", + "\n", + "1. **Strict Serializability**: All transactions observe a single total order, regardless of which data center they originate from. This is achieved through a hybrid clock protocol combining physical timestamps with logical sequence numbers.\n", + "\n", + "2. **Transparent Sharding**: Data is automatically partitioned across nodes using consistent hashing with virtual nodes. The shard map is maintained by the Placement Driver (PD) service, which rebalances shards based on load metrics collected every 30 seconds.\n", + "\n", + "3. **Multi-Version Concurrency Control (MVCC)**: Each write creates a new version of the row, tagged with the transaction's commit timestamp. Readers never block writers and vice versa. Old versions are garbage collected after the configured retention window (default: 72 hours).\n", + "\n", + "4. **Raft-Based Replication**: Each shard is replicated across 3 nodes (configurable to 5 for critical data) using the Raft consensus protocol. Leader election typically completes within 150ms of detecting a failure.\n", + "\n", + "## System Components\n", + "\n", + "### Query Layer\n", + "The query layer accepts SQL queries via a PostgreSQL-compatible wire protocol. It includes a cost-based optimizer that considers data locality, network latency between data centers, and the current load on each shard leader. Cross-shard queries use a two-phase commit protocol coordinated by the Transaction Manager.\n", + "\n", + "### Storage Engine\n", + "The storage engine uses a modified LSM-tree (Log-Structured Merge Tree) with tiered compaction. Write amplification is typically 10-15x, which we accept as a tradeoff for consistent write latency. The engine supports both row-oriented and columnar storage formats, selected per-table.\n", + "\n", + "### Consensus Layer\n", + "Built on a custom implementation of Multi-Raft, where multiple Raft groups share a single set of connections between nodes. This reduces the connection overhead from O(n * groups) to O(n). The consensus layer also handles membership changes, snapshot transfers, and log compaction.\n", + "\n", + "### Placement Driver (PD)\n", + "The PD is a cluster-level service that manages shard placement, load balancing, and cluster metadata. It runs as a 3-node Raft group itself. The PD collects heartbeats from all storage nodes every 10 seconds and makes placement decisions based on disk usage, CPU load, and network bandwidth.\n", + "\n", + "## Consistency Model\n", + "\n", + "AuroraDB provides strict serializability by default. For read-heavy workloads that can tolerate slightly stale data, clients can opt into \"timeline reads\" which read from the closest replica regardless of its replication lag. The maximum staleness for timeline reads is bounded by the `max_staleness_ms` parameter (default: 5000ms).\n", + "\n", + "Cross-region transactions use a protocol inspired by Google's Spanner TrueTime, but instead of relying on atomic clocks, we use a software-based uncertainty interval derived from NTP synchronization. The uncertainty window is typically 2-5ms within a data center and 10-50ms across regions.\n", + "\n", + "## Performance Characteristics\n", + "\n", + "- **Write latency (same region)**: p50 = 4ms, p99 = 15ms\n", + "- **Write latency (cross region)**: p50 = 80ms, p99 = 200ms\n", + "- **Read latency (leader)**: p50 = 1ms, p99 = 5ms\n", + "- **Read latency (timeline)**: p50 = 0.5ms, p99 = 2ms\n", + "- **Throughput per node**: ~50,000 QPS (mixed read/write)\n", + "\"\"\",\n", + "\n", + " \"storage_engine.md\": \"\"\"# AuroraDB Storage Engine Deep Dive\n", + "\n", + "## LSM-Tree Implementation\n", + "\n", + "The storage engine is built around a modified LSM-tree with the following layers:\n", + "\n", + "### MemTable\n", + "Incoming writes are first buffered in an in-memory skip list (the MemTable). When the MemTable reaches 64MB, it is frozen and a new MemTable is created. The frozen MemTable is then flushed to disk as an SSTable (Sorted String Table) in Level 0.\n", + "\n", + "We use a concurrent skip list implementation that allows multiple writers without locking. Each writer thread reserves a sequence number from an atomic counter before inserting into the skip list, ensuring a total order on writes.\n", + "\n", + "### Write-Ahead Log (WAL)\n", + "Every write is first appended to a WAL before being inserted into the MemTable. The WAL is synced to disk using `fdatasync()` by default. For higher throughput at the cost of durability, operators can enable group commit which batches WAL syncs every 1ms.\n", + "\n", + "The WAL is segmented into 128MB files. Old segments are deleted after the corresponding MemTable has been flushed to an SSTable. During recovery, we replay any WAL entries that were not yet flushed.\n", + "\n", + "### Compaction Strategy\n", + "We use a tiered compaction strategy (as opposed to leveled compaction used by systems like RocksDB). In tiered compaction:\n", + "\n", + "- Level 0 holds recently flushed SSTables (up to 4 files)\n", + "- When Level N has enough files, they are merged into a single file at Level N+1\n", + "- Each level is approximately 10x the size of the previous level\n", + "- Write amplification is 10-15x (compared to 20-30x for leveled compaction)\n", + "\n", + "The tradeoff is higher read amplification: a point lookup may need to check multiple files at each level. We mitigate this with Bloom filters (false positive rate: 0.01%) and a block cache that holds frequently accessed data blocks in memory.\n", + "\n", + "### Block Format\n", + "SSTables are divided into 16KB blocks. Each block contains:\n", + "- A sorted sequence of key-value pairs with prefix compression\n", + "- A restart index every 16 keys for binary search\n", + "- A CRC32 checksum for corruption detection\n", + "- Optional LZ4 or Zstd compression (Zstd by default for levels >= 2)\n", + "\n", + "### Columnar Storage\n", + "For analytical workloads, AuroraDB supports a columnar storage format inspired by Apache Parquet. Columnar tables store each column in a separate file with:\n", + "- Run-length encoding for repeated values\n", + "- Dictionary encoding for low-cardinality columns\n", + "- Delta encoding for timestamps and sequential integers\n", + "\n", + "Columnar tables are read-optimized and do not support point updates. Updates are applied through a merge-on-read approach where a row-oriented delta store is periodically merged into the columnar base files.\n", + "\n", + "## MVCC Implementation\n", + "\n", + "Each key-value pair in the storage engine is stored with a version timestamp. The full key format is:\n", + "\n", + "```\n", + "[user_key][timestamp][value_type]\n", + "```\n", + "\n", + "Where `timestamp` is encoded in descending order so that the most recent version appears first during iteration. `value_type` is either `Put` or `Delete` (tombstone).\n", + "\n", + "### Garbage Collection\n", + "Old versions are removed during compaction. A version is eligible for GC if:\n", + "1. There is a newer version of the same key\n", + "2. The version's timestamp is older than the GC safe point\n", + "3. No active transaction or snapshot references that version\n", + "\n", + "The GC safe point is computed as `current_time - retention_window`. The default retention window is 48 hours, which allows point-in-time recovery for up to 2 days.\n", + "\n", + "**Note**: The retention window was recently reduced from 72 hours to 48 hours in v3.2 to reduce storage overhead. Some documentation may still reference the old value.\n", + "\n", + "## Disk Space Management\n", + "\n", + "AuroraDB pre-allocates disk space in 1GB chunks to avoid filesystem fragmentation. The storage engine monitors available disk space and triggers an emergency compaction when free space drops below 15%. If free space drops below 5%, the node transitions to read-only mode and alerts the Placement Driver to migrate shards away.\n", + "\"\"\",\n", + "\n", + " \"transaction_protocol.md\": \"\"\"# AuroraDB Transaction Protocol\n", + "\n", + "## Overview\n", + "\n", + "AuroraDB implements a distributed transaction protocol that provides strict serializability across all shards and data centers. The protocol combines optimistic concurrency control for reads with pessimistic locking for writes.\n", + "\n", + "## Transaction Lifecycle\n", + "\n", + "### 1. Begin Transaction\n", + "When a client begins a transaction, the Transaction Manager (TM) assigns a start timestamp from the hybrid clock. This timestamp determines the snapshot the transaction will read from.\n", + "\n", + "### 2. Read Phase\n", + "All reads are served from the MVCC snapshot at the start timestamp. Reads do not acquire locks and never block. If a read encounters a key that has a pending write (a lock left by another transaction), it waits for the lock to be resolved rather than aborting.\n", + "\n", + "### 3. Write Phase\n", + "Writes are buffered locally until commit time. When the transaction commits, the TM sends a prewrite request to each shard that has pending writes. The prewrite:\n", + "- Acquires a lock on each key\n", + "- Writes the new value with a provisional timestamp\n", + "- Checks for write-write conflicts (another transaction has written to the same key since our start timestamp)\n", + "\n", + "If any prewrite fails due to a conflict, the entire transaction is aborted.\n", + "\n", + "### 4. Commit Phase\n", + "After all prewrites succeed, the TM obtains a commit timestamp (which must be greater than the start timestamp and all timestamps observed during the transaction). The TM then sends a commit message to the primary shard, which:\n", + "- Writes the commit record\n", + "- Replaces the provisional timestamp with the commit timestamp\n", + "- Releases all locks\n", + "\n", + "Secondary shards are notified asynchronously and apply the commit in the background.\n", + "\n", + "## Cross-Shard Transactions\n", + "\n", + "For transactions spanning multiple shards, AuroraDB uses a two-phase commit (2PC) protocol:\n", + "\n", + "**Phase 1 (Prepare)**: The TM sends prepare messages to all participating shards. Each shard votes to commit or abort based on its local conflict checks.\n", + "\n", + "**Phase 2 (Commit/Abort)**: If all shards vote to commit, the TM writes the commit decision to its local WAL and sends commit messages. If any shard votes to abort, all shards are told to rollback.\n", + "\n", + "### Handling Coordinator Failures\n", + "If the TM crashes during 2PC, the participating shards will hold their locks until the TM recovers and replays its decision log. To bound the lock hold time, each lock has a TTL of 10 seconds. After the TTL expires, other transactions can \"push\" the lock by contacting the TM to resolve the transaction's status.\n", + "\n", + "## Timestamp Oracle\n", + "\n", + "The hybrid clock used for timestamps combines:\n", + "- **Physical component**: Wall clock time from NTP-synchronized system clock\n", + "- **Logical component**: Monotonically increasing counter that handles clock skew\n", + "\n", + "The Timestamp Oracle (TSO) is a centralized service that allocates timestamps in batches of 10,000 to reduce network round trips. Each batch is pre-allocated to a specific data center, so local transactions can get timestamps without cross-region communication.\n", + "\n", + "### Clock Skew Handling\n", + "When a transaction spans multiple data centers, the commit timestamp must account for clock uncertainty. The TM adds the maximum observed NTP uncertainty (typically 10-50ms) to the commit timestamp to ensure that the commit appears after all operations in real time.\n", + "\n", + "This is similar to Google Spanner's TrueTime, but uses NTP instead of atomic clocks. The practical impact is that cross-region transactions have an additional latency equal to the uncertainty window.\n", + "\n", + "## Deadlock Detection\n", + "\n", + "AuroraDB uses a distributed deadlock detector that runs on the Placement Driver. Each storage node reports its wait-for edges (transaction A is waiting for transaction B's lock) to the PD every 500ms. The PD builds a global wait-for graph and breaks cycles by aborting the youngest transaction in the cycle.\n", + "\n", + "**Known limitation**: The 500ms reporting interval means deadlocks can persist for up to 1 second before detection. This is acceptable for our workload but may be problematic for latency-sensitive applications. A future improvement would be to use push-based reporting where nodes immediately notify the PD when a new wait edge is created.\n", + "\"\"\",\n", + "\n", + " \"replication_consensus.md\": \"\"\"# AuroraDB Replication and Consensus\n", + "\n", + "## Raft Implementation\n", + "\n", + "AuroraDB uses Multi-Raft for replication, where each shard is a separate Raft group. Our Raft implementation follows the paper closely with several optimizations:\n", + "\n", + "### Leader Election\n", + "- Election timeout: randomized between 1000-2000ms (increased from the paper's recommendation to reduce spurious elections in WAN deployments)\n", + "- Pre-vote protocol enabled to prevent disruptions from partitioned nodes\n", + "- Priority-based election: nodes in the same data center as the current leader get higher priority to maintain locality\n", + "\n", + "### Log Replication\n", + "- Batched log entries: up to 1MB per AppendEntries RPC\n", + "- Pipeline mode: the leader sends new entries without waiting for acknowledgment of previous ones\n", + "- Asynchronous log application: committed entries are applied to the state machine in a background thread\n", + "\n", + "### Snapshots\n", + "When a follower falls too far behind (more than 10,000 log entries), the leader sends a snapshot instead of replaying individual entries. Snapshots are created using the storage engine's checkpoint mechanism, which creates a point-in-time copy using hard links.\n", + "\n", + "## Multi-Region Replication\n", + "\n", + "For global deployments, AuroraDB supports two replication modes:\n", + "\n", + "### Synchronous Replication (Default)\n", + "The Raft leader waits for a majority of replicas to acknowledge before committing. In a 3-replica group spanning 3 data centers, this means the leader waits for one remote acknowledgment, adding one round-trip of cross-region latency to every write.\n", + "\n", + "### Follower Read\n", + "Followers can serve reads directly if the client specifies a `read_timestamp` and the follower has applied all entries up to that timestamp. This is used to implement timeline reads with bounded staleness.\n", + "\n", + "The follower checks its applied index against the leader's commit index (received in heartbeats) to determine if it is sufficiently up to date. If not, it waits for the entries to be applied or redirects the read to the leader.\n", + "\n", + "## Membership Changes\n", + "\n", + "Adding or removing replicas from a Raft group uses the joint consensus approach described in the Raft paper. The PD orchestrates membership changes to ensure that no more than one change is in progress per Raft group at any time.\n", + "\n", + "When a node is decommissioned:\n", + "1. The PD marks the node as \"draining\"\n", + "2. All Raft groups with replicas on that node are migrated to other nodes\n", + "3. The PD verifies all migrations are complete\n", + "4. The node is removed from the cluster\n", + "\n", + "## Conflict Resolution\n", + "\n", + "Since AuroraDB uses strict serializability, there are no write conflicts at the replication layer - all writes go through the Raft leader and are totally ordered. Read-write conflicts are handled at the transaction layer (see Transaction Protocol document).\n", + "\n", + "### Split Brain Prevention\n", + "Raft's quorum-based approach inherently prevents split brain. A leader can only commit entries if it has acknowledgment from a majority, and at most one leader can exist per term. We also use leader leases (5-second duration) to allow local reads without an additional round of Raft consensus, further improving read latency.\n", + "\"\"\",\n", + "\n", + " \"operational_runbook.md\": \"\"\"# AuroraDB Operational Runbook\n", + "\n", + "## Monitoring\n", + "\n", + "### Key Metrics to Watch\n", + "\n", + "**Storage Engine**\n", + "- `aurora_lsm_write_amplification`: Should be 10-15x. If consistently above 20x, check compaction settings.\n", + "- `aurora_memtable_size_bytes`: Monitor for unexpected growth. If MemTables aren't flushing, check disk I/O.\n", + "- `aurora_block_cache_hit_ratio`: Should be above 95%. Low hit ratio indicates working set exceeds cache size.\n", + "\n", + "**Transactions**\n", + "- `aurora_txn_conflict_rate`: Conflicts above 5% indicate contention. Consider range partitioning the hot keys.\n", + "- `aurora_txn_commit_latency_p99`: Cross-region commits should be under 250ms. Higher values suggest clock sync issues.\n", + "- `aurora_deadlock_count`: Should be near zero. Frequent deadlocks suggest lock ordering issues in the application.\n", + "\n", + "**Replication**\n", + "- `aurora_raft_leader_changes`: Frequent leader changes indicate network instability or node overload.\n", + "- `aurora_replication_lag_ms`: For timeline reads, lag should be under `max_staleness_ms` (default 5000ms).\n", + "- `aurora_snapshot_transfer_rate`: If snapshot transfers are frequent, followers may be chronically behind.\n", + "\n", + "## Common Operations\n", + "\n", + "### Scaling Out\n", + "Adding a new node:\n", + "1. Start the AuroraDB process with `--join=`\n", + "2. The node registers with the PD and waits for shard assignments\n", + "3. The PD schedules shard migrations based on the current load balance\n", + "4. Migration is online - the node starts serving reads for migrated shards immediately and writes after the Raft membership change completes\n", + "\n", + "### Backup and Recovery\n", + "AuroraDB supports incremental backups using SSTable-level snapshots. To perform a backup:\n", + "\n", + "```\n", + "aurora-ctl backup --dest=s3://backups/aurora/2024-01-15 --incremental\n", + "```\n", + "\n", + "The backup captures a consistent snapshot across all shards using a distributed snapshot protocol. Recovery restores from the latest full backup plus incremental deltas.\n", + "\n", + "**Note**: Point-in-time recovery is supported for up to 72 hours using the WAL archives. This differs from the MVCC retention window (48 hours) because WAL-based recovery replays transactions rather than relying on stored versions.\n", + "\n", + "### Region Failover\n", + "If an entire data center fails:\n", + "1. Raft leaders in the failed region will time out (1-2 seconds)\n", + "2. New leaders are elected in surviving regions\n", + "3. The PD detects the failed region and stops scheduling new shards there\n", + "4. Cross-region write latency increases as all writes now go to remote leaders\n", + "\n", + "Recovery time objective (RTO) for region failover: 5-10 seconds for Raft re-election, plus up to 30 seconds for the PD to update its shard map.\n", + "\n", + "## Troubleshooting\n", + "\n", + "### High Write Latency\n", + "1. Check `aurora_raft_proposal_pending`: High values mean the Raft pipeline is saturated\n", + "2. Check `aurora_lsm_level0_files`: If above 12, compaction can't keep up with write rate\n", + "3. Check WAL sync latency: Slow disk can bottleneck writes at the WAL sync step\n", + "4. Check for lock contention: Use `aurora-ctl txn list-locks` to find long-held locks\n", + "\n", + "### Stale Reads\n", + "If timeline reads return stale data beyond the configured `max_staleness_ms`:\n", + "1. Check `aurora_replication_lag_ms` on the follower serving the read\n", + "2. Verify network connectivity between the follower and leader\n", + "3. Check if the follower is overloaded (high CPU can delay log application)\n", + "4. Consider increasing `max_staleness_ms` or routing reads to the leader\n", + "\n", + "### Cluster Sizing Recommendations\n", + "- **Small** (< 100GB): 3 nodes, 1 region, 3 replicas per shard\n", + "- **Medium** (100GB - 1TB): 5-9 nodes, 1-2 regions, 3 replicas per shard\n", + "- **Large** (1TB - 10TB): 15-30 nodes, 2-3 regions, 3-5 replicas per shard\n", + "- **XL** (> 10TB): 50+ nodes, 3+ regions, 5 replicas for critical data\n", + "\"\"\",\n", + "\n", + " \"migration_guide_v3.md\": \"\"\"# AuroraDB v3.0 Migration Guide\n", + "\n", + "## Breaking Changes in v3.0\n", + "\n", + "### New Wire Protocol\n", + "v3.0 introduces a new binary wire protocol for inter-node communication, replacing the previous Protobuf-based protocol. The new protocol reduces serialization overhead by 40% and supports zero-copy reads.\n", + "\n", + "**Migration Impact**: All nodes must be upgraded simultaneously within a maintenance window. Mixed-version clusters are not supported for more than 1 hour during rolling upgrades.\n", + "\n", + "### Storage Format Changes\n", + "The SSTable format has been updated to v3 with the following changes:\n", + "- Block size increased from 8KB to 16KB for better compression ratios\n", + "- Added column statistics in the block footer for predicate pushdown\n", + "- New index block format with multi-level indexing for large SSTables\n", + "\n", + "v3.0 can read v2 format SSTables, but once a compaction runs, the output will be in v3 format. There is no downgrade path - back up your data before upgrading.\n", + "\n", + "### Configuration Changes\n", + "\n", + "| Old Parameter | New Parameter | Notes |\n", + "|---|---|---|\n", + "| `gc_retention_hours=72` | `gc_retention_hours=48` | Default reduced; adjust if you need longer retention |\n", + "| `raft_election_timeout=300ms` | `raft_election_timeout=1000ms` | Increased for WAN stability |\n", + "| `compaction_style=level` | `compaction_style=tier` | Tiered compaction is now default |\n", + "| `max_staleness_seconds=10` | `max_staleness_ms=5000` | Changed to milliseconds |\n", + "\n", + "### Deprecated Features\n", + "- **Synchronous replication mode \"all\"**: Previously, you could require ALL replicas to acknowledge a write. This mode is removed in v3.0 because it provides little benefit over majority quorum and significantly increases tail latency. Use `replication_factor=5` with majority quorum instead.\n", + "- **Table-level TTL**: Replaced by row-level TTL which offers finer granularity.\n", + "\n", + "## Upgrade Procedure\n", + "\n", + "1. **Pre-flight checks**\n", + " - Verify all nodes are running v2.8 or later (direct upgrade from v2.7 or earlier is not supported)\n", + " - Run `aurora-ctl cluster check --pre-upgrade` to validate cluster health\n", + " - Ensure at least 30% free disk space on each node (for SSTable format migration)\n", + "\n", + "2. **Backup**\n", + " - Take a full cluster backup: `aurora-ctl backup --dest=s3://backups/pre-v3-upgrade --full`\n", + " - Verify the backup can be restored in a test environment\n", + "\n", + "3. **Upgrade PD nodes first**\n", + " - Stop PD leader, upgrade binary, restart\n", + " - Wait for PD leader election (< 5 seconds)\n", + " - Upgrade remaining PD nodes one at a time\n", + "\n", + "4. **Upgrade storage nodes**\n", + " - Upgrade one node at a time, waiting for all Raft groups to have full quorum before proceeding\n", + " - Monitor `aurora_raft_under_replicated_groups` - should return to 0 between each node upgrade\n", + " - Expected time per node: 2-5 minutes\n", + "\n", + "5. **Post-upgrade**\n", + " - Run `aurora-ctl cluster check --post-upgrade`\n", + " - Verify SSTable format migration is progressing (check `aurora_sstable_v3_ratio`)\n", + " - SSTable migration completes during normal compaction cycles (1-7 days depending on write volume)\n", + "\"\"\"\n", + "}\n", + "\n", + "# Build the full corpus\n", + "corpus_text = \"\"\n", + "for filename, content in DOCUMENTS.items():\n", + " corpus_text += f\"\\n{'='*80}\\n\"\n", + " corpus_text += f\"FILE: {filename}\\n\"\n", + " corpus_text += f\"{'='*80}\\n\"\n", + " corpus_text += content\n", + "\n", + "total_tokens = count_tokens(corpus_text)\n", + "print(f\"Corpus: {len(DOCUMENTS)} documents\")\n", + "print(f\"Total characters: {len(corpus_text):,}\")\n", + "print(f\"Approximate tokens: {total_tokens:,}\")\n", + "print(f\"\\nDocuments:\")\n", + "for name, content in DOCUMENTS.items():\n", + " tokens = count_tokens(content)\n", + " print(f\" {name}: {tokens:,} tokens\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Single-Document Analysis\n", + "\n", + "Start with a simple task: summarize the architecture overview document and extract key design decisions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "single_doc = DOCUMENTS[\"architecture_overview.md\"]\n", + "\n", + "messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a senior software architect reviewing technical documentation. Be precise and cite specific details from the document.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"Analyze the following architecture document and provide:\n", + "\n", + "1. A 3-sentence executive summary\n", + "2. The top 3 design tradeoffs (what was chosen and what was given up)\n", + "3. Any potential concerns or risks you see in the design\n", + "\n", + "Document:\n", + "{single_doc}\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(f\"Context size: {count_tokens(str(messages)):,} tokens\\n\")\n", + "response = timed_completion(messages)\n", + "print(f\"\\n{response}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Multi-Document Q&A\n", + "\n", + "Now load the **entire corpus** into context and ask questions that require synthesizing information across multiple documents. This is where long context shines - the model can find connections that would be missed if each document were processed separately." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a senior engineer doing a thorough review of a distributed database's documentation. You have access to the complete documentation set. When answering, cite the specific document and section where you found the information.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"Here is the complete documentation for AuroraDB:\n", + "\n", + "{corpus_text}\n", + "\n", + "---\n", + "\n", + "Answer these questions that require reasoning across multiple documents:\n", + "\n", + "1. The architecture overview says MVCC retention is 72 hours, but another document says 48 hours. Which is correct and why is there a discrepancy?\n", + "\n", + "2. The operational runbook mentions point-in-time recovery for 72 hours. Is this consistent with the MVCC retention window? Explain the relationship between WAL-based recovery and MVCC retention.\n", + "\n", + "3. Trace the full lifecycle of a cross-region write from the client's perspective, referencing the relevant component from each document (query layer, transaction protocol, storage engine, replication).\n", + "\n", + "4. What happens to in-flight transactions if a region fails? Synthesize information from the transaction protocol and operational runbook to give a complete answer.\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(f\"Context size: {count_tokens(str(messages)):,} tokens\\n\")\n", + "response = timed_completion(messages, max_tokens=3000)\n", + "print(f\"\\n{response}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice how the model caught the **intentional contradiction** between the architecture overview (72 hours) and the storage engine document (48 hours, with a note explaining the change). This kind of cross-document inconsistency detection is extremely valuable in real-world documentation review." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Cross-Document Synthesis\n", + "\n", + "Ask the model to identify emergent patterns, architectural concerns, and inconsistencies that only become visible when reading the full documentation set together." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a distributed systems expert performing a documentation audit. Your goal is to find inconsistencies, gaps, and architectural risks that span multiple documents. Be specific and cite exact values.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"Perform a comprehensive documentation audit of AuroraDB. Here is the complete documentation:\n", + "\n", + "{corpus_text}\n", + "\n", + "---\n", + "\n", + "Produce an audit report with these sections:\n", + "\n", + "### 1. Documentation Inconsistencies\n", + "Find all places where different documents state conflicting information. For each inconsistency, cite the exact values from each document.\n", + "\n", + "### 2. Missing Documentation\n", + "Identify topics that are referenced in one document but never fully explained elsewhere. What documentation gaps exist?\n", + "\n", + "### 3. Architectural Risk Assessment\n", + "Based on the full documentation set, identify the top 3 architectural risks. For each risk, explain which components are involved and what could go wrong.\n", + "\n", + "### 4. Configuration Dependency Map\n", + "Map out configuration parameters that affect multiple components. Which settings have cascading effects across the system?\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(f\"Context size: {count_tokens(str(messages)):,} tokens\\n\")\n", + "response = timed_completion(messages, max_tokens=4000)\n", + "print(f\"\\n{response}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Context Length Scaling\n", + "\n", + "Let's measure how the model performs with different amounts of context. We'll test the same cross-document question with progressively more documents to see how context size affects response quality and latency." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "QUESTION = \"\"\"Based on the documents provided, what is the MVCC garbage collection retention window? \n", + "If there are conflicting values, identify all of them and explain which is correct.\"\"\"\n", + "\n", + "doc_keys = list(DOCUMENTS.keys())\n", + "results = []\n", + "\n", + "# Test with increasing numbers of documents\n", + "for n_docs in [1, 2, 3, len(doc_keys)]:\n", + " subset = dict(list(DOCUMENTS.items())[:n_docs])\n", + " subset_text = \"\"\n", + " for filename, content in subset.items():\n", + " subset_text += f\"\\n{'='*60}\\nFILE: {filename}\\n{'='*60}\\n{content}\"\n", + "\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": \"Answer precisely, citing specific documents.\"},\n", + " {\"role\": \"user\", \"content\": f\"Documents:\\n{subset_text}\\n\\n{QUESTION}\"}\n", + " ]\n", + "\n", + " tokens = count_tokens(str(messages))\n", + " print(f\"\\n{'='*60}\")\n", + " print(f\"Test: {n_docs} document(s) | ~{tokens:,} tokens\")\n", + " print(f\"Documents: {', '.join(subset.keys())}\")\n", + " print(f\"{'='*60}\")\n", + "\n", + " start = time.time()\n", + " response = client.chat.completions.create(\n", + " model=MODEL,\n", + " messages=messages,\n", + " max_tokens=1024,\n", + " temperature=0.1,\n", + " )\n", + " elapsed = time.time() - start\n", + " content = response.choices[0].message.content\n", + " usage = response.usage\n", + "\n", + " results.append({\n", + " \"n_docs\": n_docs,\n", + " \"prompt_tokens\": usage.prompt_tokens,\n", + " \"completion_tokens\": usage.completion_tokens,\n", + " \"latency_s\": elapsed,\n", + " \"tokens_per_sec\": usage.completion_tokens / elapsed,\n", + " })\n", + "\n", + " print(f\"Prompt: {usage.prompt_tokens:,} tokens | Completion: {usage.completion_tokens:,} tokens\")\n", + " print(f\"Latency: {elapsed:.1f}s | Speed: {usage.completion_tokens / elapsed:.0f} tok/s\")\n", + " print(f\"\\nResponse:\\n{content[:500]}...\" if len(content) > 500 else f\"\\nResponse:\\n{content}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Summary table\n", + "print(f\"\\n{'Context Scaling Summary':=^60}\")\n", + "print(f\"{'Docs':<6} {'Prompt Tokens':<16} {'Latency (s)':<14} {'Tok/s':<10}\")\n", + "print(\"-\" * 46)\n", + "for r in results:\n", + " print(f\"{r['n_docs']:<6} {r['prompt_tokens']:<16,} {r['latency_s']:<14.1f} {r['tokens_per_sec']:<10.0f}\")\n", + "\n", + "if len(results) >= 2:\n", + " ratio = results[-1][\"latency_s\"] / results[0][\"latency_s\"]\n", + " token_ratio = results[-1][\"prompt_tokens\"] / results[0][\"prompt_tokens\"]\n", + " print(f\"\\nContext grew {token_ratio:.1f}x, latency grew {ratio:.1f}x\")\n", + " print(f\"(Sub-linear latency scaling = good! The model handles more context efficiently.)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Key Takeaway**: With 1 document (architecture overview only), the model reports 72 hours as the retention window. With the full corpus, it identifies the 48-hour update from v3.2 and explains the discrepancy. **More context leads to more accurate answers** - this is the core value proposition of long-context models." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Best Practices for Long-Context Prompting\n", + "\n", + "### Instruction Placement\n", + "\n", + "Where you place your instructions relative to the documents matters. Let's test the same question with instructions at the beginning vs. the end." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Instruction BEFORE documents\n", + "prompt_before = f\"\"\"Answer this question: What is the deadlock detection interval and what are its limitations?\n", + "\n", + "Use ONLY information from the following documents:\n", + "\n", + "{corpus_text}\"\"\"\n", + "\n", + "# Instruction AFTER documents\n", + "prompt_after = f\"\"\"{corpus_text}\n", + "\n", + "---\n", + "\n", + "Based on the documents above, answer this question: What is the deadlock detection interval and what are its limitations?\"\"\"\n", + "\n", + "print(\"=== Instructions BEFORE documents ===\")\n", + "response_before = timed_completion(\n", + " [{\"role\": \"user\", \"content\": prompt_before}],\n", + " max_tokens=512\n", + ")\n", + "print(f\"\\n{response_before}\")\n", + "\n", + "print(\"\\n\\n=== Instructions AFTER documents ===\")\n", + "response_after = timed_completion(\n", + " [{\"role\": \"user\", \"content\": prompt_after}],\n", + " max_tokens=512\n", + ")\n", + "print(f\"\\n{response_after}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tips for Long-Context Document Analysis\n", + "\n", + "1. **Use clear document delimiters**: Separate documents with headers (`FILE: name.md`) and separator lines. This helps the model attribute information to specific sources.\n", + "\n", + "2. **Place instructions after documents**: For long contexts, putting the question after the documents tends to produce more thorough answers since the instructions are closest to where the model generates its response.\n", + "\n", + "3. **Ask for citations**: Requesting that the model cite specific documents and sections improves accuracy and makes responses verifiable.\n", + "\n", + "4. **Use the system prompt for role and behavior**: Define the model's expertise and analysis style in the system message. Put the actual content and questions in the user message.\n", + "\n", + "5. **Leverage contradiction detection**: Long-context models excel at finding inconsistencies across documents. Explicitly ask for contradictions when auditing documentation.\n", + "\n", + "6. **Start with focused questions, then broaden**: Begin with specific questions to validate the model is reading the documents correctly, then move to open-ended synthesis tasks.\n", + "\n", + "7. **Consider context length vs. cost**: While 1M tokens is available, use only what you need. The scaling test above shows how quality improves with more context - use this to find the right tradeoff for your use case." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next Steps\n", + "\n", + "- **Scale up the corpus**: Replace the synthetic documents with your own documentation, codebase, or research papers\n", + "- **Combine with tool calling**: Use the [Agentic Tool-Calling example](../Agentic-Tool-Calling-with-Nemotron-Super/) to build agents that dynamically load documents into context\n", + "- **Try different reasoning modes**: Nemotron 3 Super supports `reasoning-off`, `regular`, and `low-effort` thinking modes - experiment with these for different analysis tasks\n", + "- **Deploy with vLLM**: For production workloads, deploy Nemotron 3 Super with vLLM using the [deployment cookbook](../../usage-cookbook/Nemotron-3-Super/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}