wesleyscholl · Copilot · Oct 20, 2025 · Oct 20, 2025 · Oct 20, 2025
diff --git a/README.md b/README.md
@@ -41,6 +41,10 @@ BoltAI is a compact, local-first AI agent implemented in Rust with a companion m
  - Query CLI — simple commands for search, summarization, and diagnostic output
  - macOS SwiftUI front-end — drag/drop indexing, chat-style interface, and previewed document snippets
  - PDF extraction support and safety measures to avoid dumping raw documents in prompts or UI
+ - **NLP capabilities** — Named Entity Recognition, Sentiment Analysis, and Text Summarization
+   - Pattern-based NER for extracting names, locations, organizations, dates, emails, and monetary values
+   - Lexicon-based sentiment analysis (positive, neutral, negative)
+   - Extractive text summarization using sentence scoring
 
  ## Quickstart
 
@@ -78,6 +82,49 @@ BoltAI is a compact, local-first AI agent implemented in Rust with a companion m
  # or open in Xcode to run the app target and inspect the UI
  ```
 
+ ### Use NLP features
+
+ #### Named Entity Recognition (NER)
+
+ Extract entities like names, locations, organizations, dates, emails, and monetary values from text:
+
+ ```bash
+ # Analyze a single file
+ ./target/release/boltai ner -i document.txt
+
+ # Analyze all files in a directory
+ ./target/release/boltai ner -i /path/to/docs
+
+ # Save results to a file
+ ./target/release/boltai ner -i document.txt -o entities.txt
+ ```
+
+ #### Sentiment Analysis
+
+ Determine sentiment (positive, neutral, or negative) of text:
+
+ ```bash
+ # Analyze a single file
+ ./target/release/boltai sentiment -i review.txt
+
+ # Batch analyze multiple files
+ ./target/release/boltai sentiment -i /path/to/reviews -o sentiment_results.txt
+ ```
+
+ #### Text Summarization
+
+ Generate extractive summaries of documents:
+
+ ```bash
+ # Summarize a single file
+ ./target/release/boltai summarize -i article.txt
+
+ # Summarize multiple files in a directory
+ ./target/release/boltai summarize -i /path/to/articles -o summaries.txt
+ ```
+
+ **Supported file formats**: `.txt`, `.md`, `.csv`, `.json`, `.pdf`
+
  ## Example output (CLI)
 
  After indexing, `boltai query` returns top-k similar documents and a short summary. Example (truncated):
@@ -93,15 +140,45 @@ BoltAI is a compact, local-first AI agent implemented in Rust with a companion m
  BoltAI demonstrates a privacy-first local retrieval pipeline that indexes developer documentation and supports fast summarization and search. It uses TF-IDF for initial vectorization and provides clear extension points for embeddings and LLM-based abstraction.
  ```
 
+ ### NLP Feature Examples
+
+ **Named Entity Recognition output:**
+ ```
+ Named Entities found in document.txt:
+   - John Smith (PERSON): score 0.750
+   - [email protected] (EMAIL): score 0.950
+   - New York (LOCATION): score 0.850
+   - Microsoft Corporation (ORGANIZATION): score 0.800
+   - $150,000 (MONEY): score 0.900
+   - Jan 15, 2024 (DATE): score 0.900
+ ```
+
+ **Sentiment Analysis output:**
+ ```
+ Sentiment analysis for review.txt:
+   - Label: Positive, Score: 0.857
+ ```
+
+ **Text Summarization output:**
+ ```
+ Summary of article.txt:
+ Artificial intelligence has become one of the most transformative technologies. 
+ Deep learning has achieved remarkable breakthroughs in computer vision and natural 
+ language processing. Machine learning algorithms optimize trading strategies and 
+ detect fraudulent transactions.
+ ```
+
  ## Project architecture
 
  - Rust CLI (`src/main.rs`): walks directories, extracts text (including PDFs), computes TF-IDF vectors, and writes `boltai_index.json`.
+ - NLP module (`src/nlp/`): provides pattern-based NER, lexicon-based sentiment analysis, and extractive text summarization.
  - mac-ui SwiftUI: orchestrates indexing runs, loads a capped preview of index docs (to avoid huge JSON parsing on the main thread), and sends queries to the CLI.
- - Extensibility: The CLI prompt layer is isolated to make it easy to swap the query strategy (keywords → embeddings → hybrid retrieval-augmented generation).
+ - Extensibility: The CLI prompt layer is isolated to make it easy to swap the query strategy (keywords → embeddings → hybrid retrieval-augmented generation). NLP features use lightweight rule-based approaches but can be upgraded to ML models (rust-bert) when libtorch is available.
 
  ## Design decisions & trade-offs
 
  - TF-IDF first: fast to compute, explainable, and sufficient for small-to-medium corpora. Replacing TF-IDF with dense embeddings is an intended next step for semantic search.
+ - Rule-based NLP: Uses regex patterns and lexicons for NER and sentiment analysis. Fast, no external dependencies, but less accurate than ML models. Can be upgraded to rust-bert/transformers when libtorch is available in the environment.
  - Local-first: prioritizes data privacy and low-latency responses at the expense of requiring local compute resources.
  - Safety: the UI and CLI avoid including full raw documents in prompts and no longer print raw text as a fallback. The project logs prompts to a local debug file for reproducible tuning.
 

diff --git a/src/main.rs b/src/main.rs
@@ -15,6 +15,8 @@ use once_cell::sync::Lazy;
 use serde::{Deserialize, Serialize};
 use walkdir::WalkDir;
 
+mod nlp;
+
 static WORD_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"[a-zA-Z0-9']+").unwrap());
 
 #[derive(Parser)]
@@ -43,6 +45,27 @@ enum Commands {
         #[arg(short = 'm', long = "model")]
         model: Option<String>,
     },
+    /// Extract named entities from text files
+    Ner {
+        #[arg(short, long, help = "Input file path or directory")]
+        input: PathBuf,
+        #[arg(short, long, help = "Output file for results (optional)")]
+        output: Option<PathBuf>,
+    },
+    /// Analyze sentiment of text files
+    Sentiment {
+        #[arg(short, long, help = "Input file path or directory")]
+        input: PathBuf,
+        #[arg(short, long, help = "Output file for results (optional)")]
+        output: Option<PathBuf>,
+    },
+    /// Summarize text from files
+    Summarize {
+        #[arg(short, long, help = "Input file path or directory")]
+        input: PathBuf,
+        #[arg(short, long, help = "Output file for results (optional)")]
+        output: Option<PathBuf>,
+    },
 }
 
 #[derive(Serialize, Deserialize, Debug)]
@@ -403,11 +426,208 @@ fn query_with_ollama(index_file: &Path, q: &str, k: usize, model_override: Optio
     }
 }
 
+fn handle_ner(input: &Path, output: Option<&Path>) -> Result<()> {
+    use std::io::Write;
+
+    if input.is_file() {
+        println!("Analyzing file: {}", input.display());
+        let entities = nlp::extract_entities(input)?;
+
+        let mut result = format!("Named Entities found in {}:\n", input.display());
+        for entity in &entities {
+            result.push_str(&format!("  - {} ({}): score {:.3}\n", 
+                entity.word, entity.label, entity.score));
+        }
+
+        if let Some(out_path) = output {
+            let mut file = File::create(out_path)?;
+            file.write_all(result.as_bytes())?;
+            println!("Results written to {}", out_path.display());
+        } else {
+            print!("{}", result);
+        }
+    } else if input.is_dir() {
+        let allowed_exts = ["txt", "md", "csv", "json", "pdf"];
+        let files: Vec<PathBuf> = WalkDir::new(input)
+            .into_iter()
+            .filter_map(|e| e.ok())
+            .filter(|e| e.file_type().is_file())
+            .filter(|e| {
+                e.path()
+                    .extension()
+                    .and_then(|s| s.to_str())
+                    .map(|ext| allowed_exts.contains(&ext))
+                    .unwrap_or(false)
+            })
+            .map(|e| e.path().to_path_buf())
+            .collect();
+
+        let mut all_results = String::new();
+        for file_path in files {
+            println!("Analyzing: {}", file_path.display());
+            match nlp::extract_entities(&file_path) {
+                Ok(entities) => {
+                    all_results.push_str(&format!("\nFile: {}\n", file_path.display()));
+                    for entity in &entities {
+                        all_results.push_str(&format!("  - {} ({}): score {:.3}\n", 
+                            entity.word, entity.label, entity.score));
+                    }
+                }
+                Err(e) => {
+                    eprintln!("Error processing {}: {}", file_path.display(), e);
+                }
+            }
+        }
+
+        if let Some(out_path) = output {
+            let mut file = File::create(out_path)?;
+            file.write_all(all_results.as_bytes())?;
+            println!("Results written to {}", out_path.display());
+        } else {
+            print!("{}", all_results);
+        }
+    } else {
+        return Err(anyhow!("Input path does not exist or is not a file/directory"));
+    }
+
+    Ok(())
+}
+
+fn handle_sentiment(input: &Path, output: Option<&Path>) -> Result<()> {
+    use std::io::Write;
+
+    if input.is_file() {
+        println!("Analyzing sentiment of file: {}", input.display());
+        let sentiments = nlp::analyze_sentiment(input)?;
+
+        let mut result = format!("Sentiment analysis for {}:\n", input.display());
+        for sentiment in &sentiments {
+            result.push_str(&format!("  - Label: {}, Score: {:.3}\n", 
+                sentiment.label, sentiment.score));
+        }
+
+        if let Some(out_path) = output {
+            let mut file = File::create(out_path)?;
+            file.write_all(result.as_bytes())?;
+            println!("Results written to {}", out_path.display());
+        } else {
+            print!("{}", result);
+        }
+    } else if input.is_dir() {
+        let allowed_exts = ["txt", "md", "csv", "json", "pdf"];
+        let files: Vec<PathBuf> = WalkDir::new(input)
+            .into_iter()
+            .filter_map(|e| e.ok())
+            .filter(|e| e.file_type().is_file())
+            .filter(|e| {
+                e.path()
+                    .extension()
+                    .and_then(|s| s.to_str())
+                    .map(|ext| allowed_exts.contains(&ext))
+                    .unwrap_or(false)
+            })
+            .map(|e| e.path().to_path_buf())
+            .collect();
+
+        let mut all_results = String::new();
+        for file_path in files {
+            println!("Analyzing: {}", file_path.display());
+            match nlp::analyze_sentiment(&file_path) {
+                Ok(sentiments) => {
+                    all_results.push_str(&format!("\nFile: {}\n", file_path.display()));
+                    for sentiment in &sentiments {
+                        all_results.push_str(&format!("  - Label: {}, Score: {:.3}\n", 
+                            sentiment.label, sentiment.score));
+                    }
+                }
+                Err(e) => {
+                    eprintln!("Error processing {}: {}", file_path.display(), e);
+                }
+            }
+        }
+
+        if let Some(out_path) = output {
+            let mut file = File::create(out_path)?;
+            file.write_all(all_results.as_bytes())?;
+            println!("Results written to {}", out_path.display());
+        } else {
+            print!("{}", all_results);
+        }
+    } else {
+        return Err(anyhow!("Input path does not exist or is not a file/directory"));
+    }
+
+    Ok(())
+}
+
+fn handle_summarize(input: &Path, output: Option<&Path>) -> Result<()> {
+    use std::io::Write;
+
+    if input.is_file() {
+        println!("Summarizing file: {}", input.display());
+        let summary = nlp::summarize_text(input)?;
+
+        let result = format!("Summary of {}:\n{}\n", input.display(), summary);
+
+        if let Some(out_path) = output {
+            let mut file = File::create(out_path)?;
+            file.write_all(result.as_bytes())?;
+            println!("Summary written to {}", out_path.display());
+        } else {
+            print!("{}", result);
+        }
+    } else if input.is_dir() {
+        let allowed_exts = ["txt", "md", "csv", "json", "pdf"];
+        let files: Vec<PathBuf> = WalkDir::new(input)
+            .into_iter()
+            .filter_map(|e| e.ok())
+            .filter(|e| e.file_type().is_file())
+            .filter(|e| {
+                e.path()
+                    .extension()
+                    .and_then(|s| s.to_str())
+                    .map(|ext| allowed_exts.contains(&ext))
+                    .unwrap_or(false)
+            })
+            .map(|e| e.path().to_path_buf())
+            .collect();
+
+        let mut all_results = String::new();
+        for file_path in files {
+            println!("Summarizing: {}", file_path.display());
+            match nlp::summarize_text(&file_path) {
+                Ok(summary) => {
+                    all_results.push_str(&format!("\nFile: {}\nSummary: {}\n", 
+                        file_path.display(), summary));
+                }
+                Err(e) => {
+                    eprintln!("Error processing {}: {}", file_path.display(), e);
+                }
+            }
+        }
+
+        if let Some(out_path) = output {
+            let mut file = File::create(out_path)?;
+            file.write_all(all_results.as_bytes())?;
+            println!("Summaries written to {}", out_path.display());
+        } else {
+            print!("{}", all_results);
+        }
+    } else {
+        return Err(anyhow!("Input path does not exist or is not a file/directory"));
+    }
+
+    Ok(())
+}
+
 fn main() -> Result<()> {
     let cli = Cli::parse();
     match cli.command {
         Commands::Index { dir, out } => index_dir(&dir, &out)?,
         Commands::Query { index, q, k, model } => query_with_ollama(&index, &q, k, model)?,
+        Commands::Ner { input, output } => handle_ner(&input, output.as_deref())?,
+        Commands::Sentiment { input, output } => handle_sentiment(&input, output.as_deref())?,
+        Commands::Summarize { input, output } => handle_summarize(&input, output.as_deref())?,
     }
     Ok(())
 }
diff --git a/src/nlp/mod.rs b/src/nlp/mod.rs
@@ -0,0 +1,8 @@
+// NLP module for BoltAI
+pub mod ner;
+pub mod sentiment;
+pub mod summarization;
+
+pub use ner::extract_entities;
+pub use sentiment::analyze_sentiment;
+pub use summarization::summarize_text;