Skip to content

Conversation

@frederick-douglas-pearce
Copy link
Contributor

FAQ Automation System with AI-Powered Triage

This PR introduces a comprehensive automated FAQ management system that uses Retrieval-Augmented Generation (RAG) and LLM-based triage to intelligently process new FAQ proposals submitted via GitHub issues.

🎯 Overview

The system automatically analyzes new FAQ proposals and determines whether to:

  • NEW: Create a new FAQ entry (question not covered)
  • UPDATE: Merge proposal with existing FAQ (adds valuable context)
  • DUPLICATE: Mark as duplicate (already fully answered)

🚀 Features

AI-Powered Decision Making

  • Uses minsearch for semantic retrieval of similar FAQs
  • Leverages OpenAI GPT-4 with structured outputs for intelligent triage
  • Analyzes proposals against top 5 most similar existing FAQs
  • Makes context-aware decisions with detailed rationale

Automated Workflow

  • GitHub issue template for structured FAQ proposals
  • Automatic PR creation for NEW/UPDATE actions
  • Automatic issue closure with explanation for DUPLICATE actions
  • Complete git workflow (branch creation, commits, PR generation)

Developer Experience

  • Comprehensive test suite (unit + integration tests)
  • CLI tool for local testing and debugging
  • Detailed documentation (README + CONTRIBUTING)
  • Type-safe with Pydantic validation

📦 What's Included

1. FAQ Automation Module (faq_automation/)

Core Functions (core.py)

  • parse_frontmatter(): Extract YAML frontmatter from markdown
  • write_frontmatter(): Write structured FAQ files
  • read_questions(): Load and parse all FAQs from a course
  • generate_document_id(): Create collision-resistant 10-char IDs
  • find_question_files(): Map document IDs to file paths
  • find_largest_sort_order(): Determine next sort order number

RAG Agent (rag_agent.py)

  • FAQAgent: Main agent class for processing proposals
  • FAQDecision: Pydantic model for structured LLM outputs
  • process_faq_proposal(): Convenience function for single proposals
  • Implements complete RAG pipeline (retrieval → analysis → decision)

GitHub Actions Integration (actions.py)

  • create_new_faq_file(): Generate new FAQ markdown files
  • update_existing_faq_file(): Update existing FAQ content
  • generate_pr_body(): Create detailed PR descriptions
  • generate_duplicate_comment(): Create helpful duplicate comments

CLI Tool (cli.py)

  • Command-line interface for GitHub Actions
  • Issue body parsing with validation
  • JSON output for workflow consumption
  • Error handling and logging

2. GitHub Integration

Issue Template (.github/ISSUE_TEMPLATE/faq-proposal.yml)

Fields:

  • Course selection (machine-learning-zoomcamp)
  • Question (required)
  • Answer (required)
  • Validation checklist

Workflow (.github/workflows/faq-automation.yml)

Trigger: Issue opened with 'faq-proposal' label

Steps:

  1. Extract question/answer from issue
  2. Run FAQ automation CLI
  3. Handle NEW/UPDATE: Create PR
  4. Handle DUPLICATE: Comment and close
  5. Error handling with notifications

3. Testing

Unit Tests

  • test_faq_automation.py: Core function tests (frontmatter parsing, document ID generation, search result filtering)
  • test_faq_actions.py: Action function tests (PR body generation, duplicate comment generation)
  • test_cli_parsing.py: CLI parsing tests (issue body parsing, multi-line content handling, error cases)

4. Documentation

README.md Updates

  • Complete feature overview
  • Quick start guide
  • Architecture documentation
  • FAQ automation explanation
  • Development setup instructions

CONTRIBUTING.md (New)

  • Step-by-step contributor guide
  • FAQ proposal best practices
  • Writing guidelines for Q&A
  • File structure explanation
  • Manual contribution workflow

5. Dependencies

New Dependencies (added to pyproject.toml)

  • minsearch>=0.4.1 - Lightweight text search
  • openai>=1.0.0 - LLM integration
  • pydantic>=2.0.0 - Data validation

🔄 How It Works

User Flow

  1. User creates GitHub issue using "FAQ Proposal" template
  2. Fills out course, question, and answer fields
  3. Submits issue (automatically gets faq-proposal label)

Automation Flow

Issue Created
    ↓
Extract Q&A → Load Existing FAQs → Build Search Index
    ↓
Search Similar FAQs (top 5) → Send to GPT-4 with Prompt
    ↓
LLM Decision (NEW/UPDATE/DUPLICATE)
    ↓
├─ NEW: Create branch → Generate file → Commit → Push → Create PR
├─ UPDATE: Create branch → Modify file → Commit → Push → Create PR
└─ DUPLICATE: Post comment with link → Close issue

LLM Decision Process

The system sends GPT-4:

  • The proposed Q&A
  • Top 5 similar existing FAQs
  • Available course sections
  • Detailed instructions and rules

GPT-4 returns structured output with action, rationale, document_id, section_id, section_rationale, order, question, proposed_content, filename_slug, and warnings.

🛠️ Technical Details

RAG Pipeline

  1. Indexing: minsearch indexes all FAQs with text fields (section, question, answer)
  2. Retrieval: Semantic search finds top-K similar FAQs
  3. Augmentation: Search results + metadata sent to LLM
  4. Generation: LLM produces structured decision with reasoning

File Naming Convention

{NNN}_{document_id}_{slug}.md
001_abc1234567_when-does-course-start.md

Frontmatter Structure

---
id: abc1234567
question: 'When does the course start?'
sort_order: 1
---

📋 Setup Requirements

Before Merging

  1. Add OpenAI API Key

    • Go to Settings → Secrets and variables → Actions
    • Add OPENAI_API_KEY secret
  2. Verify Permissions

    • Workflow needs: contents: write, issues: write, pull-requests: write
    • Already configured in workflow file

After Merging

  1. Test the Workflow

    • Create a test issue using the FAQ Proposal template
    • Verify automation runs correctly
    • Check PR creation or duplicate detection
  2. Monitor Initial Runs

    • Review first few automated PRs
    • Adjust prompts if needed
    • Fine-tune decision thresholds

🧪 Testing

Run the test suite:

# All tests
make test

# Unit tests only
make test-unit

# Specific test file
pytest tests/unit/test_faq_automation.py -v

Test locally with CLI:

# Create test issue file
cat > test_issue.txt <<'TESTEOF'
### Question
How do I test the FAQ automation?

### Answer
Create a test issue using the FAQ Proposal template.
TESTEOF

# Run automation
python -m faq_automation.cli \
  --issue-body "$(cat test_issue.txt)" \
  --issue-number 999 \
  --course machine-learning-zoomcamp

🎓 Course Support

Initial Support: machine-learning-zoomcamp

Future Expansion: The system is designed to support all courses. To add a new course:

  1. Update issue template dropdown
  2. Ensure course has _metadata.yaml
  3. Workflow automatically handles any course directory

📊 Expected Impact

For Contributors

  • Lower barrier: Simple issue form vs. understanding repo structure
  • Faster feedback: Immediate automated analysis
  • Better quality: AI helps refine questions and placement

For Maintainers

  • Reduced manual work: Auto-triage and PR creation
  • Consistency: Structured decisions with rationale
  • Scalability: Handles multiple simultaneous proposals

For Users

  • Better FAQs: Duplicate detection prevents redundancy
  • Richer content: UPDATE action merges knowledge
  • Faster updates: Automated process speeds up FAQ additions

🔍 Example Outputs

NEW Decision PR

Title: [FAQ Bot] NEW: How do I troubleshoot Docker networking issues?

Body:
✨ FAQ NEW

Course: machine-learning-zoomcamp
Section: module-1 (Docker-related troubleshooting fits in Module 1)
Related Issue: #42

Question: How do I troubleshoot Docker networking issues?

Decision Rationale: This specific networking troubleshooting scenario is not covered in existing FAQs.

Placement Details:
- Section ID: module-1
- Sort Order: End of section
- Filename Slug: troubleshoot-docker-networking-issues

DUPLICATE Comment

🔄 Duplicate FAQ Entry

Thank you for your proposal! After analyzing existing FAQs, this question appears to already be covered.

Matching FAQ
Question: When does the Machine Learning Zoomcamp course start?
Document ID: 9e508f2212
Section: general

Rationale: The existing FAQ fully answers when the course starts, including registration details and Telegram channel information.

Where to Find This FAQ
- Live Site: https://datatalks.club/faq/machine-learning-zoomcamp.html#9e508f2212
- Source File: _questions/machine-learning-zoomcamp/general/

---
🤖 This issue has been automatically closed by FAQ Bot.
If you believe this is an error, please reopen and mention a maintainer.

🚦 Breaking Changes

None - this is a new feature addition.

📝 Notes

  • Workflow only triggers on issues with faq-proposal label (applied automatically by template)
  • PRs created by FAQ Bot require manual review and approval (intentionally not auto-merged)
  • The system currently uses GPT-4; can be changed to other models via --model parameter
  • All file operations preserve existing frontmatter (IDs, custom fields, etc.)

✅ Checklist

  • Code implements all proposed features
  • Comprehensive test coverage added
  • Documentation updated (README + CONTRIBUTING)
  • Dependencies added to pyproject.toml
  • GitHub Actions workflow configured
  • Issue template created
  • No breaking changes to existing functionality

🔗 Related

  • Implements functionality from notebooks/rag.ipynb
  • Complements existing static site generator (generate_website.py)
  • Uses same frontmatter format and directory structure

Ready to Review! 🎉

Once merged and configured with OpenAI API key, the FAQ automation system will be live and ready to process proposals.

This commit introduces a comprehensive FAQ automation system that uses RAG
and LLM-based triage to intelligently process new FAQ proposals.

Features:
- AI-powered FAQ proposal analysis (NEW/UPDATE/DUPLICATE decisions)
- Automated PR creation for approved changes
- GitHub issue template for structured FAQ proposals
- Complete test suite with unit and integration tests
- Comprehensive documentation (README, CONTRIBUTING)

Components:
- faq_automation/: Python module with core logic
  - core.py: FAQ processing utilities
  - rag_agent.py: LLM-based decision agent using OpenAI
  - actions.py: GitHub Actions integration helpers
  - cli.py: Command-line interface for workflow

- .github/workflows/faq-automation.yml: GitHub Actions workflow
- .github/ISSUE_TEMPLATE/faq-proposal.yml: Structured issue template
- tests/: Comprehensive test coverage
- CONTRIBUTING.md: Contributor guidelines
- README.md: Updated with full documentation

Dependencies added:
- minsearch: Lightweight text search for FAQ retrieval
- openai: LLM integration for decision making
- pydantic: Structured output validation

The system processes FAQ proposals through:
1. Issue submission via GitHub template
2. Retrieval of similar existing FAQs
3. LLM analysis and decision (NEW/UPDATE/DUPLICATE)
4. Automated PR creation or issue closure with feedback

Supports: machine-learning-zoomcamp (initial course)
Can be extended to support all courses in the future
@frederick-douglas-pearce
Copy link
Contributor Author

frederick-douglas-pearce commented Oct 18, 2025

Created this WIP PR using claude-code. Needs testing before it can be merged.

@frederick-douglas-pearce frederick-douglas-pearce marked this pull request as draft October 18, 2025 07:36
@alexeygrigorev
Copy link
Member

That's a lot of code and text =)

Have you tested it?

@alexeygrigorev
Copy link
Member

Also can you add a link to the contributing guide to the FAQ pages at the top?

And we probably need an issue template where people can select the course they are continuing to. Maybe we can have a drop down list? Not sure if it's possible

But at least we need a format so it's easy to extract it from the text

@frederick-douglas-pearce
Copy link
Contributor Author

frederick-douglas-pearce commented Oct 20, 2025

Thank you for your comments, @alexeygrigorev! I will work on addressing them. FYI, this is just an early draft that I added, made with claude-code, and needs a lot of testing still. I've already caught a few bugs but many more to come I'm sure.
Note, another student is interested in collaborating so I wanted to put this very verbose PR up so they can get familiar with what is going on.
The README.md file is very long and I'm happy to shorten it, modify it, etc however you'd like. I thought claude did a good job of explaining the site in it, and as I said, someone else wants to onboard to help out, so I'm going to leave it for now, until they have a chance to read it.

@frederick-douglas-pearce
Copy link
Contributor Author

frederick-douglas-pearce commented Oct 20, 2025

Also can you add a link to the contributing guide to the FAQ pages at the top?

And we probably need an issue template where people can select the course they are continuing to. Maybe we can have a drop down list? Not sure if it's possible

But at least we need a format so it's easy to extract it from the text
@alexeygrigorev, I've completed these two tasks that you requested:

  • I added a link to the contributing guide at the top of the FAQ pages. It is just one line of text with a link, nothing fancy. Let me know if there are changes you'd like
  • An issue template is included in the .github/ISSUE_TEMPLATE folder. It includes a dropdown to select one of the four courses currently available. I've tested it by manually triggering the faq automation workflow and it worked ok. If new courses are added, they will need to be added to the template file. I could make an issue for this, and I'm sure it could be automated, but not sure that is worth the effort. Thoughts?
contributing_link_example

- Update CLI default model from gpt-4 to gpt-5-nano
- Update RAG agent default model to gpt-5-nano
- Update GitHub Actions workflow to use gpt-5-nano
- Fix setuptools package configuration
- Fix minsearch version requirement (0.0.7)
Added a simple text banner with link to CONTRIBUTING.md at the top
of each course page to encourage user contributions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Set font size to 1.17em to match the question heading size for
better visibility and consistency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Change course field from input to dropdown menu
- Add all 4 available courses as options:
  - machine-learning-zoomcamp
  - data-engineering-zoomcamp
  - llm-zoomcamp
  - mlops-zoomcamp
- Update uv.lock to sync with FAQ automation dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add workflow_dispatch trigger with issue_number input
- Support both automatic (issue opened) and manual execution
- Fetch issue data when manually triggered
- Update error handler to use correct issue number

This allows testing the workflow on feature branches without merging.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@alexeygrigorev
Copy link
Member

Is it ready for merge?

- Remove incorrect 'python -m' prefix from uv commands in Makefile
- Update parse_issue_body to stop collecting content at any ### section
- Ensures Checklist and other sections are excluded from parsed answer
- All tests now passing
- Remove Quick Start section (missing necessary environment setup)
- Update all GPT-4 references to GPT-5
- Development section now contains all necessary setup instructions
@frederick-douglas-pearce
Copy link
Contributor Author

I've made significant updates to the README.md file in response to @alexeygrigorev 's comments, including testing all code snippets included to make sure they work. Moving on to other comments next...

- Document test_faq_automation.py (core functions)
- Document test_cli_parsing.py (issue body parsing)
- Document test_faq_actions.py (GitHub Actions integration)
- Update test coverage section with test counts
- Add example commands for running FAQ automation tests
- Reorganize test_faq_automation.py into classes (TestParseFrontmatter, TestWriteFrontmatter, TestGenerateDocumentId, TestKeepRelevant)
- Reorganize test_cli_parsing.py into TestParseIssueBody class
- Reorganize test_faq_actions.py into classes (TestGeneratePRBody, TestGenerateDuplicateComment)
- Follow the established test structure pattern from test_sorting.py
- All 102 unit tests passing
- Add example for running specific FAQ automation test method
- Add example for running specific CLI parsing test method
- Show proper class-based test structure in examples
- Use 'make test' commands as primary examples
- Show 'uv run pytest' commands as alternatives
- Remove '--extra dev' flag for consistency with Makefile
- All test commands now consistent across documentation
Update faq_automation/rag_agent.py to use the correct OpenAI API syntax
from the notebook prototype:
- Changed beta.chat.completions.parse to responses.parse
- Changed messages parameter to input
- Changed response_format parameter to text_format
- Updated response parsing to extract from response.output

All 116 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Eliminates code duplication between JavaScript and Python parsing by
using a single Python implementation for all issue body parsing.

Changes:
- Add parse_full_issue_body() to extract course, question, and answer
- Create scripts/extract_issue_fields.py for GitHub Actions integration
- Simplify workflow to use Python parsing instead of JS
- Add 6 comprehensive tests for parse_full_issue_body()
- Update tests/README.md with new test documentation

Benefits:
- Single source of truth for parsing logic
- All parsing code is testable
- Easier maintenance (one language, one implementation)
- No duplication between JS and Python

Test results: All 122 tests pass (was 116, added 6 new tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Creates a shared GitHub Actions helper module to eliminate bash
scripting for writing to GITHUB_OUTPUT environment variable.

Changes:
- Create faq_automation/github_actions.py with write_github_output()
  - Supports both multiline (heredoc) and single-line formats
  - Handles local testing mode (prints to stdout)
  - Environment detection helpers (is_github_actions, get_github_output_path)

- Update scripts/extract_issue_fields.py to use shared function
  - Import write_github_output from github_actions module
  - Remove duplicate implementation

- Create scripts/write_faq_decision_output.py
  - Reads faq_decision.json
  - Writes to GITHUB_OUTPUT using Python
  - Replaces bash: echo "decision=$(jq -c .)" >> $GITHUB_OUTPUT

- Update .github/workflows/faq-automation.yml
  - Replace bash output logic with Python script call
  - Cleaner, more maintainable workflow

- Add comprehensive tests (10 new tests)
  - test_github_actions.py with 3 test classes
  - Multiline and single-line output formats
  - Local testing mode behavior
  - Environment detection

- Update tests/README.md documentation

Benefits:
- All GitHub Actions integration uses Python
- Single source of truth for output writing
- Fully testable (no bash to test)
- Consistent approach across all scripts
- Better error handling and validation

Test results: All 132 tests pass (was 122, added 10 new tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Replace 'uv pip install --system -e .' with 'uv sync --no-dev' for:
- Consistency with local development workflow
- Modern uv best practices
- Declarative dependency management
- Faster installation (skips dev dependencies not needed in automation)

Benefits:
- Uses same uv sync approach as README recommends locally
- Only installs production dependencies needed for FAQ automation
- Automatic virtual environment management
- Better reproducibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Removes code duplication and simplifies workflow by having CLI parse
course from issue body instead of extracting it separately.

Changes:
- Remove --course argument from CLI (faq_automation/cli.py)
  - Always use parse_full_issue_body() to extract course, question, answer
  - Update all references from args.course to parsed course variable

- Simplify workflow (.github/workflows/faq-automation.yml)
  - Keep "Fetch issue body" step for clean separation
  - Remove "Extract issue fields with Python" step entirely (~26 lines removed)
  - Simplify "Process FAQ with AI" step (~14 lines removed)
  - Pass full issue body directly to CLI without reconstruction

- Delete scripts/extract_issue_fields.py (no longer needed)

- Update README.md example
  - Add course field to test_issue.txt example
  - Remove --course argument from CLI command

Benefits:
- Workflow reduced from 3 steps to 2 steps
- Removed ~40 lines from workflow file
- Deleted 1 script file (67 lines)
- Simpler CLI interface (one less argument)
- Single parsing path (no conditionals)
- Easier to maintain and understand

All 132 tests pass ✅

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
After switching to 'uv sync --no-dev', Python commands need to run
within the uv-managed virtual environment using 'uv run'.

Changes:
- Use 'uv run python -m faq_automation.cli' to run CLI module
- Use 'uv run scripts/write_faq_decision_output.py' for script
  (leverages shebang line for cleaner syntax)

This ensures Python commands execute with the correct dependencies
installed by uv sync.

Fixes: ModuleNotFoundError: No module named 'yaml'

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Removed parse_issue_body() function from CLI as it is no longer used in production.
After removing the --course argument, the CLI always uses parse_full_issue_body()
which extracts course, question, and answer from the full issue body.

Changes:
- Removed parse_issue_body() function from faq_automation/cli.py (58 lines)
- Removed TestParseIssueBody class from tests/unit/test_cli_parsing.py (79 lines)
- Updated import statement in test file
- Updated tests/README.md to reflect new test count (132 → 127 tests)

All 127 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Updated description to clarify tests cover both site generator and FAQ automation
- Added guidance for adding tests to FAQ automation system
- Specified which test files to use for different FAQ automation components
- Added notes about testing with real issue bodies and mocking external dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added 26 integration tests covering the complete end-to-end FAQ automation
workflow, bringing total test count from 143 to 153.

Test coverage includes:
- FAQ agent integration (5 tests): initialization, search, and proposal processing
- File creation and updates (3 tests): creating new FAQs and updating existing ones
- PR and comment generation (5 tests): generating outputs for all decision types
- CLI integration (2 tests): parsing issue bodies and full CLI execution
- Error handling and edge cases (4 tests): empty sections, non-existent docs, etc.
- Site generator integration (3 tests): verifying created files work with generate_website.py
- End-to-end workflows (3 tests): complete NEW/UPDATE/DUPLICATE flows

All tests use mocked OpenAI API responses for consistency and speed, while
performing real file I/O to verify format compatibility with the site generator.

Updated tests/README.md with comprehensive documentation of the new test suite.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Modified FAQ automation workflow to include "Closes #<issue>" in PR body.
This uses GitHub's native auto-close feature to automatically close the
originating issue when the FAQ bot's PR is merged.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Updated README.md and CONTRIBUTING.md to clarify that issues with
NEW or UPDATE actions are automatically closed when their associated
PRs are merged, using GitHub's native "Closes #issue" feature.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@frederick-douglas-pearce frederick-douglas-pearce marked this pull request as ready for review October 26, 2025 11:39
Removed workflow_dispatch trigger and all related logic for manual workflow
execution. The workflow now only triggers automatically on issue creation
with the faq-proposal label, which simplifies the codebase and reduces
maintenance burden.

Changes:
- Removed workflow_dispatch trigger and inputs section
- Simplified job condition to only check for faq-proposal label
- Removed fallback logic for manual issue number input
- Streamlined issue body fetching (always uses context.payload.issue)
- Cleaned up error handler to assume issue context

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@frederick-douglas-pearce
Copy link
Contributor Author

This PR is finally ready for final review and merge:

  • the GH action workflow has had all file manipulation, rag chat and any other non-github functionality moved to python with extensive unit and integration tests added (all 153 tests pass)
  • the readme, contributing and tests/readme files have all been updated
  • all faq-automation labeled issues close automatically if action=duplicate (with explanation comment), or upon the auto-generated PR from action=new or update being merged.

@alexeygrigorev alexeygrigorev merged commit 9fe9cef into DataTalksClub:main Oct 27, 2025
@alexeygrigorev
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants