Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 78 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ BoltAI is a compact, local-first AI agent implemented in Rust with a companion m
- Query CLI — simple commands for search, summarization, and diagnostic output
- macOS SwiftUI front-end — drag/drop indexing, chat-style interface, and previewed document snippets
- PDF extraction support and safety measures to avoid dumping raw documents in prompts or UI
- **NLP capabilities** — Named Entity Recognition, Sentiment Analysis, and Text Summarization
- Pattern-based NER for extracting names, locations, organizations, dates, emails, and monetary values
- Lexicon-based sentiment analysis (positive, neutral, negative)
- Extractive text summarization using sentence scoring

## Quickstart

Expand Down Expand Up @@ -78,6 +82,49 @@ BoltAI is a compact, local-first AI agent implemented in Rust with a companion m
# or open in Xcode to run the app target and inspect the UI
```

### Use NLP features

#### Named Entity Recognition (NER)

Extract entities like names, locations, organizations, dates, emails, and monetary values from text:

```bash
# Analyze a single file
./target/release/boltai ner -i document.txt

# Analyze all files in a directory
./target/release/boltai ner -i /path/to/docs

# Save results to a file
./target/release/boltai ner -i document.txt -o entities.txt
```

#### Sentiment Analysis

Determine sentiment (positive, neutral, or negative) of text:

```bash
# Analyze a single file
./target/release/boltai sentiment -i review.txt

# Batch analyze multiple files
./target/release/boltai sentiment -i /path/to/reviews -o sentiment_results.txt
```

#### Text Summarization

Generate extractive summaries of documents:

```bash
# Summarize a single file
./target/release/boltai summarize -i article.txt

# Summarize multiple files in a directory
./target/release/boltai summarize -i /path/to/articles -o summaries.txt
```

**Supported file formats**: `.txt`, `.md`, `.csv`, `.json`, `.pdf`

## Example output (CLI)

After indexing, `boltai query` returns top-k similar documents and a short summary. Example (truncated):
Expand All @@ -93,15 +140,45 @@ BoltAI is a compact, local-first AI agent implemented in Rust with a companion m
BoltAI demonstrates a privacy-first local retrieval pipeline that indexes developer documentation and supports fast summarization and search. It uses TF-IDF for initial vectorization and provides clear extension points for embeddings and LLM-based abstraction.
```

### NLP Feature Examples

**Named Entity Recognition output:**
```
Named Entities found in document.txt:
- John Smith (PERSON): score 0.750
- [email protected] (EMAIL): score 0.950
- New York (LOCATION): score 0.850
- Microsoft Corporation (ORGANIZATION): score 0.800
- $150,000 (MONEY): score 0.900
- Jan 15, 2024 (DATE): score 0.900
```

**Sentiment Analysis output:**
```
Sentiment analysis for review.txt:
- Label: Positive, Score: 0.857
```

**Text Summarization output:**
```
Summary of article.txt:
Artificial intelligence has become one of the most transformative technologies.
Deep learning has achieved remarkable breakthroughs in computer vision and natural
language processing. Machine learning algorithms optimize trading strategies and
detect fraudulent transactions.
```

## Project architecture

- Rust CLI (`src/main.rs`): walks directories, extracts text (including PDFs), computes TF-IDF vectors, and writes `boltai_index.json`.
- NLP module (`src/nlp/`): provides pattern-based NER, lexicon-based sentiment analysis, and extractive text summarization.
- mac-ui SwiftUI: orchestrates indexing runs, loads a capped preview of index docs (to avoid huge JSON parsing on the main thread), and sends queries to the CLI.
- Extensibility: The CLI prompt layer is isolated to make it easy to swap the query strategy (keywords → embeddings → hybrid retrieval-augmented generation).
- Extensibility: The CLI prompt layer is isolated to make it easy to swap the query strategy (keywords → embeddings → hybrid retrieval-augmented generation). NLP features use lightweight rule-based approaches but can be upgraded to ML models (rust-bert) when libtorch is available.

## Design decisions & trade-offs

- TF-IDF first: fast to compute, explainable, and sufficient for small-to-medium corpora. Replacing TF-IDF with dense embeddings is an intended next step for semantic search.
- Rule-based NLP: Uses regex patterns and lexicons for NER and sentiment analysis. Fast, no external dependencies, but less accurate than ML models. Can be upgraded to rust-bert/transformers when libtorch is available in the environment.
- Local-first: prioritizes data privacy and low-latency responses at the expense of requiring local compute resources.
- Safety: the UI and CLI avoid including full raw documents in prompts and no longer print raw text as a fallback. The project logs prompts to a local debug file for reproducible tuning.

Expand Down
220 changes: 220 additions & 0 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ use once_cell::sync::Lazy;
use serde::{Deserialize, Serialize};
use walkdir::WalkDir;

mod nlp;

static WORD_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"[a-zA-Z0-9']+").unwrap());

#[derive(Parser)]
Expand Down Expand Up @@ -43,6 +45,27 @@ enum Commands {
#[arg(short = 'm', long = "model")]
model: Option<String>,
},
/// Extract named entities from text files
Ner {
#[arg(short, long, help = "Input file path or directory")]
input: PathBuf,
#[arg(short, long, help = "Output file for results (optional)")]
output: Option<PathBuf>,
},
/// Analyze sentiment of text files
Sentiment {
#[arg(short, long, help = "Input file path or directory")]
input: PathBuf,
#[arg(short, long, help = "Output file for results (optional)")]
output: Option<PathBuf>,
},
/// Summarize text from files
Summarize {
#[arg(short, long, help = "Input file path or directory")]
input: PathBuf,
#[arg(short, long, help = "Output file for results (optional)")]
output: Option<PathBuf>,
},
}

#[derive(Serialize, Deserialize, Debug)]
Expand Down Expand Up @@ -403,11 +426,208 @@ fn query_with_ollama(index_file: &Path, q: &str, k: usize, model_override: Optio
}
}

fn handle_ner(input: &Path, output: Option<&Path>) -> Result<()> {
use std::io::Write;

if input.is_file() {
println!("Analyzing file: {}", input.display());
let entities = nlp::extract_entities(input)?;

let mut result = format!("Named Entities found in {}:\n", input.display());
for entity in &entities {
result.push_str(&format!(" - {} ({}): score {:.3}\n",
entity.word, entity.label, entity.score));
}

if let Some(out_path) = output {
let mut file = File::create(out_path)?;
file.write_all(result.as_bytes())?;
println!("Results written to {}", out_path.display());
} else {
print!("{}", result);
}
} else if input.is_dir() {
let allowed_exts = ["txt", "md", "csv", "json", "pdf"];
let files: Vec<PathBuf> = WalkDir::new(input)
.into_iter()
.filter_map(|e| e.ok())
.filter(|e| e.file_type().is_file())
.filter(|e| {
e.path()
.extension()
.and_then(|s| s.to_str())
.map(|ext| allowed_exts.contains(&ext))
.unwrap_or(false)
})
.map(|e| e.path().to_path_buf())
.collect();

let mut all_results = String::new();
for file_path in files {
println!("Analyzing: {}", file_path.display());
match nlp::extract_entities(&file_path) {
Ok(entities) => {
all_results.push_str(&format!("\nFile: {}\n", file_path.display()));
for entity in &entities {
all_results.push_str(&format!(" - {} ({}): score {:.3}\n",
entity.word, entity.label, entity.score));
}
}
Err(e) => {
eprintln!("Error processing {}: {}", file_path.display(), e);
}
}
}

if let Some(out_path) = output {
let mut file = File::create(out_path)?;
file.write_all(all_results.as_bytes())?;
println!("Results written to {}", out_path.display());
} else {
print!("{}", all_results);
}
} else {
return Err(anyhow!("Input path does not exist or is not a file/directory"));
}

Ok(())
}

fn handle_sentiment(input: &Path, output: Option<&Path>) -> Result<()> {
use std::io::Write;

if input.is_file() {
println!("Analyzing sentiment of file: {}", input.display());
let sentiments = nlp::analyze_sentiment(input)?;

let mut result = format!("Sentiment analysis for {}:\n", input.display());
for sentiment in &sentiments {
result.push_str(&format!(" - Label: {}, Score: {:.3}\n",
sentiment.label, sentiment.score));
}

if let Some(out_path) = output {
let mut file = File::create(out_path)?;
file.write_all(result.as_bytes())?;
println!("Results written to {}", out_path.display());
} else {
print!("{}", result);
}
} else if input.is_dir() {
let allowed_exts = ["txt", "md", "csv", "json", "pdf"];
let files: Vec<PathBuf> = WalkDir::new(input)
.into_iter()
.filter_map(|e| e.ok())
.filter(|e| e.file_type().is_file())
.filter(|e| {
e.path()
.extension()
.and_then(|s| s.to_str())
.map(|ext| allowed_exts.contains(&ext))
.unwrap_or(false)
})
.map(|e| e.path().to_path_buf())
.collect();

let mut all_results = String::new();
for file_path in files {
println!("Analyzing: {}", file_path.display());
match nlp::analyze_sentiment(&file_path) {
Ok(sentiments) => {
all_results.push_str(&format!("\nFile: {}\n", file_path.display()));
for sentiment in &sentiments {
all_results.push_str(&format!(" - Label: {}, Score: {:.3}\n",
sentiment.label, sentiment.score));
}
}
Err(e) => {
eprintln!("Error processing {}: {}", file_path.display(), e);
}
}
}

if let Some(out_path) = output {
let mut file = File::create(out_path)?;
file.write_all(all_results.as_bytes())?;
println!("Results written to {}", out_path.display());
} else {
print!("{}", all_results);
}
} else {
return Err(anyhow!("Input path does not exist or is not a file/directory"));
}

Ok(())
}

fn handle_summarize(input: &Path, output: Option<&Path>) -> Result<()> {
use std::io::Write;

if input.is_file() {
println!("Summarizing file: {}", input.display());
let summary = nlp::summarize_text(input)?;

let result = format!("Summary of {}:\n{}\n", input.display(), summary);

if let Some(out_path) = output {
let mut file = File::create(out_path)?;
file.write_all(result.as_bytes())?;
println!("Summary written to {}", out_path.display());
} else {
print!("{}", result);
}
} else if input.is_dir() {
let allowed_exts = ["txt", "md", "csv", "json", "pdf"];
let files: Vec<PathBuf> = WalkDir::new(input)
.into_iter()
.filter_map(|e| e.ok())
.filter(|e| e.file_type().is_file())
.filter(|e| {
e.path()
.extension()
.and_then(|s| s.to_str())
.map(|ext| allowed_exts.contains(&ext))
.unwrap_or(false)
})
.map(|e| e.path().to_path_buf())
.collect();

let mut all_results = String::new();
for file_path in files {
println!("Summarizing: {}", file_path.display());
match nlp::summarize_text(&file_path) {
Ok(summary) => {
all_results.push_str(&format!("\nFile: {}\nSummary: {}\n",
file_path.display(), summary));
}
Err(e) => {
eprintln!("Error processing {}: {}", file_path.display(), e);
}
}
}

if let Some(out_path) = output {
let mut file = File::create(out_path)?;
file.write_all(all_results.as_bytes())?;
println!("Summaries written to {}", out_path.display());
} else {
print!("{}", all_results);
}
} else {
return Err(anyhow!("Input path does not exist or is not a file/directory"));
}

Ok(())
}

fn main() -> Result<()> {
let cli = Cli::parse();
match cli.command {
Commands::Index { dir, out } => index_dir(&dir, &out)?,
Commands::Query { index, q, k, model } => query_with_ollama(&index, &q, k, model)?,
Commands::Ner { input, output } => handle_ner(&input, output.as_deref())?,
Commands::Sentiment { input, output } => handle_sentiment(&input, output.as_deref())?,
Commands::Summarize { input, output } => handle_summarize(&input, output.as_deref())?,
}
Ok(())
}
8 changes: 8 additions & 0 deletions src/nlp/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
// NLP module for BoltAI
pub mod ner;
pub mod sentiment;
pub mod summarization;

pub use ner::extract_entities;
pub use sentiment::analyze_sentiment;
pub use summarization::summarize_text;
Loading