Skip to content

edufeed-org/amb-sitemap-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

amb-sitemap-parser

A modern TypeScript library for parsing sitemaps and extracting educational metadata from web pages. Built with pure ESM for Node.js 20+.

Features

  • 📋 Parse XML sitemaps (regular sitemaps and sitemap indexes)
  • 🌐 Concurrent page fetching with rate limiting
  • 🔍 Extract JSON-LD metadata from HTML pages
  • 📊 Filter educational content automatically
  • 🎯 Full TypeScript support with exported types
  • 🚀 Pure ESM for modern JavaScript environments
  • 🧪 Tested with Vitest

Installation

npm install amb-sitemap-parser

Requirements

  • Node.js >= 20.0.0
  • ESM-compatible project (set "type": "module" in package.json)

CLI Usage

The package includes a command-line interface for parsing sitemaps and extracting educational metadata.

Installation

# Install globally for CLI usage
npm install -g amb-sitemap-parser

# Or use with npx (no installation needed)
npx amb-sitemap-parser --help

Commands

parse - Parse a sitemap and list URLs

# Basic usage
amb-sitemap-parser parse https://example.com/sitemap.xml

# Limit to first 10 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 10

# Pretty-print JSON output
amb-sitemap-parser parse https://example.com/sitemap.xml --pretty

# Combine options
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 20 --pretty

Output:

{
  "urls": ["url1", "url2", "url3"],
  "isIndex": false,
  "count": 3
}

extract - Extract educational metadata

Full pipeline: parse sitemap → fetch pages → extract metadata → filter educational content

# Basic usage
amb-sitemap-parser extract https://example.com/sitemap.xml

# With verbose logging (logs to stderr)
amb-sitemap-parser extract https://example.com/sitemap.xml --verbose

# Limit URLs and adjust concurrency
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 20 --max-concurrency 10

# Pretty-print output
amb-sitemap-parser extract https://example.com/sitemap.xml --pretty

# Custom timeout (in milliseconds)
amb-sitemap-parser extract https://example.com/sitemap.xml --timeout 60000

# Save to file
amb-sitemap-parser extract https://example.com/sitemap.xml > results.json

Output:

{
  "metadata": [
    {
      "url": "https://example.com/course",
      "title": "Introduction to Python",
      "description": "Learn Python basics",
      "jsonLdData": [
        {
          "@type": "Course",
          "name": "Introduction to Python"
        }
      ],
      "extractionTime": 45
    }
  ],
  "summary": {
    "totalUrls": 100,
    "fetched": 95,
    "withMetadata": 78,
    "educational": 42
  }
}

fetch - Fetch URL(s) directly and extract metadata

Extract educational metadata from specific URLs without needing a sitemap.

# Fetch a single URL
amb-sitemap-parser fetch https://example.com/course

# Fetch multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2

# With verbose logging
amb-sitemap-parser fetch https://example.com/course --verbose

# Adjust concurrency for multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2 --max-concurrency 10

# Pretty-print output
amb-sitemap-parser fetch https://example.com/course --pretty

# Custom timeout (in milliseconds)
amb-sitemap-parser fetch https://example.com/course --timeout 60000

# Save to file
amb-sitemap-parser fetch https://example.com/course > result.json

Output: (same format as extract command)

{
  "metadata": [
    {
      "url": "https://example.com/course",
      "title": "Introduction to Python",
      "description": "Learn Python basics",
      "jsonLdData": [
        {
          "@type": "Course",
          "name": "Introduction to Python"
        }
      ],
      "extractionTime": 45
    }
  ],
  "summary": {
    "totalUrls": 1,
    "fetched": 1,
    "withMetadata": 1,
    "educational": 1
  }
}

CLI Options Reference

parse command options:

  • -l, --limit <number> - Limit to first N URLs
  • -p, --pretty - Pretty-print JSON output

extract command options:

  • -l, --limit <number> - Limit URLs to process
  • -c, --max-concurrency <number> - Maximum concurrent requests (default: 5)
  • -t, --timeout <number> - Request timeout in milliseconds (default: 30000)
  • -o, --output <filepath> - Save metadata to JSONL file (one JSON-LD object per line)
  • -p, --pretty - Pretty-print JSON output
  • -v, --verbose - Show progress logs (written to stderr)

fetch command options:

  • -c, --max-concurrency <number> - Maximum concurrent requests (default: 5)
  • -t, --timeout <number> - Request timeout in milliseconds (default: 30000)
  • -o, --output <filepath> - Save metadata to JSONL file (one JSON-LD object per line)
  • -p, --pretty - Pretty-print JSON output
  • -v, --verbose - Show progress logs (written to stderr)

JSONL Output Format

When using the --output option, metadata is saved in JSON Lines (JSONL) format - one JSON-LD object per line. This format is ideal for streaming, piping, and processing with standard Unix tools.

How It Works

Each JSON-LD object from a page becomes a separate line in the file:

{"url":"https://example.com/course1","jsonLd":{"@type":"Course","name":"Python 101"}}
{"url":"https://example.com/course1","jsonLd":{"@type":"LearningResource","name":"Exercise 1"}}
{"url":"https://example.com/course2","jsonLd":{"@type":"Course","name":"JavaScript Basics"}}

Key benefits:

  • ✅ Each line is a complete, independent record
  • ✅ Stream-friendly (process one record at a time)
  • ✅ Perfect for Unix pipelines (jq, grep, awk)
  • ✅ Easy to parallelize processing
  • ✅ If a page has multiple JSON-LD objects, each gets its own line

Output Behavior

With --output flag:

amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl
  • File (results.jsonl): Contains JSONL records (one JSON-LD object per line)
  • stdout: Summary statistics only
    {
      "summary": {
        "totalUrls": 100,
        "fetched": 95,
        "withMetadata": 78,
        "educational": 45,
        "recordsWritten": 67
      }
    }

Without --output flag:

amb-sitemap-parser extract https://example.com/sitemap.xml
  • stdout: Full metadata array + summary (existing behavior)

CLI Examples

# Quick test: parse and see first 5 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 5 --pretty

# Extract metadata from first 10 URLs with verbose logging
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 10 --verbose --pretty

# High-performance extraction: 20 concurrent requests
amb-sitemap-parser extract https://example.com/sitemap.xml --max-concurrency 20

# Save to JSONL file
amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl --verbose

# Save from direct URL fetch
amb-sitemap-parser fetch https://example.com/course --output course.jsonl

# Pipeline with jq for further processing
amb-sitemap-parser extract https://example.com/sitemap.xml | jq '.metadata[] | select(.jsonLd != null)'

JSONL Piping Examples

The JSONL format enables powerful command-line workflows:

# Save metadata to file
amb-sitemap-parser extract https://example.com/sitemap.xml -o results.jsonl -v

# Count total records
wc -l < results.jsonl

# Filter by type: only Courses
cat results.jsonl | jq 'select(.jsonLd."@type" == "Course")'

# Extract all course names
cat results.jsonl | jq -r '.jsonLd.name' | sort | uniq

# Count by type
cat results.jsonl | jq -r '.jsonLd."@type"' | sort | uniq -c

# Find courses from specific domain
cat results.jsonl | jq 'select(.url | contains("example.com"))'

# Get URLs with specific property
cat results.jsonl | jq -r 'select(.jsonLd.educationalLevel) | .url'

# Process one record at a time
cat results.jsonl | while read line; do
  echo "$line" | jq '.jsonLd.name'
done

# Parallel processing with GNU parallel
cat results.jsonl | parallel --pipe -L1 'echo {} | jq ".jsonLd.name"'

# Convert to CSV (name and URL only)
cat results.jsonl | jq -r '[.jsonLd.name, .url] | @csv'

# Filter and save to new file
cat results.jsonl | jq 'select(.jsonLd."@type" == "Course")' > courses-only.jsonl

Integration with AMB-Nostr-Converter

Convert extracted AMB resources directly to Nostr events using the --jsonl flag for seamless streaming between tools.

Prerequisites

Install the AMB-Nostr converter:

npm install -g @edufeed-org/amb-nostr-converter
# Or use npx for on-demand usage

Basic Pipeline

The --jsonl flag outputs one AMB resource per line to stdout, perfect for piping:

# Extract and convert a single resource
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl --limit 1 | \
  npx amb-convert amb:nostr --pretty

# Extract from direct URL
amb-sitemap-parser fetch https://example.com/course --jsonl | \
  npx amb-convert amb:nostr --pretty

Processing Multiple Resources

Convert all resources from a sitemap to Nostr events:

⚠️ Important: When using a while read loop with piped input, amb-convert can consume stdin from the outer pipeline, causing most lines to be skipped. Always save to a temp file first, then process:

# ✅ CORRECT: Save to temp file first, then process
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl > "$TEMP_FILE"
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr 2>/dev/null
done < "$TEMP_FILE" > nostr-events.jsonl

# ❌ WRONG: Direct piping loses most lines due to stdin consumption
# amb-sitemap-parser extract url --jsonl | while read resource; do ...

Why does this happen? When amb-convert runs inside the while read loop, it reads from stdin and inadvertently consumes input from the outer pipeline that was meant for the read command. This causes most lines to be skipped.

With verbose logging:

TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl -v 2>&1 | tee /dev/stderr > "$TEMP_FILE"
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr 2>/dev/null
done < "$TEMP_FILE" > nostr-events.jsonl

Signing Nostr Events

Sign events with your Nostr private key:

# Save to temp file first (required to avoid stdin consumption issues)
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl > "$TEMP_FILE"

# Using nsec (bech32 format)
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC 2>/dev/null
done < "$TEMP_FILE" > signed-events.jsonl

# Using hex private key
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr --private-key $NOSTR_PRIVATE_KEY 2>/dev/null
done < "$TEMP_FILE" > signed-events.jsonl

Complete Workflow Examples

# 1. Test with a single URL first (single resource is OK to pipe directly)
amb-sitemap-parser fetch https://example.com/course --jsonl -v | \
  npx amb-convert amb:nostr --pretty

# 2. Process multiple resources (MUST use temp file approach)
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl -l 10 -v > "$TEMP_FILE"
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC 2>/dev/null
done < "$TEMP_FILE" > events.jsonl

# 3. Full sitemap conversion with signed events
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl --max-concurrency 10 -v 2> extraction.log > "$TEMP_FILE"
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC --pretty 2>/dev/null
done < "$TEMP_FILE" > all-events.jsonl

# 4. Filter and convert only Course types (jq doesn't consume stdin, so intermediate file is safe)
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl > amb-resources.jsonl
jq -c 'select(.type | contains(["Course"]))' amb-resources.jsonl > courses.jsonl
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr 2>/dev/null
done < courses.jsonl > courses-events.jsonl

# 5. Save AMB resources and convert separately (recommended approach)
amb-sitemap-parser extract https://example.com/sitemap.xml -o amb-resources.jsonl -v
while IFS= read -r resource; do 
  echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC 2>/dev/null
done < amb-resources.jsonl > nostr-events.jsonl

Why --jsonl Flag?

The --jsonl flag is specifically designed for tool integration:

Without --jsonl (default):

{
  "metadata": [
    { "url": "...", "jsonLd": [{...}] },
    { "url": "...", "jsonLd": [{...}] }
  ],
  "summary": {...}
}

❌ Complex nested structure, hard to pipe

With --jsonl:

{AMB resource 1}
{AMB resource 2}
{AMB resource 3}

✅ One resource per line, perfect for streaming

Integration Benefits

  • Streaming: Process resources one at a time, low memory usage
  • Composable: Combine with standard Unix tools (jq, grep, awk)
  • Resumable: Stop and resume processing at any point
  • Flexible: Filter, transform, and route data as needed
  • Observable: Use verbose mode (-v) to monitor progress via stderr

Troubleshooting

Issue: "Cannot use --output and --jsonl together"

# ❌ Wrong
amb-sitemap-parser extract url --output file.jsonl --jsonl

# ✅ Correct (save to file)
amb-sitemap-parser extract url --output file.jsonl

# ✅ Correct (pipe to stdout)
amb-sitemap-parser extract url --jsonl > file.jsonl

Issue: No output when piping

# Make sure to use --jsonl flag for streaming
amb-sitemap-parser extract url --jsonl | npx amb-convert amb:nostr

Issue: Need to see progress while piping

# Use --verbose flag (logs go to stderr, data to stdout)
amb-sitemap-parser extract url --jsonl -v | npx amb-convert amb:nostr 2>&1

Programmatic Usage (Node.js)

Quick Start

import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';

// Parse a sitemap
const parser = new SitemapParser();
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');

// Fetch pages
const fetcher = new PageFetcher({ maxConcurrency: 5 });
const results = await fetcher.fetchPages(sitemap.urls.slice(0, 10));

// Extract metadata
const extractor = new MetadataExtractor();
const metadata = await extractor.extractFromPages(results);

// Filter educational content
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
console.log(educational);

API Reference

SitemapParser

Parse XML sitemaps and extract URLs.

const parser = new SitemapParser({
  logger: (msg, level) => console.log(`[${level}] ${msg}`)
});

// Parse from string
const sitemap = await parser.parseSitemap(xmlContent);

// Parse from URL
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');

// Validate URL
const isValid = SitemapParser.isValidSitemapUrl(url);

// Filter educational URLs
const filtered = SitemapParser.filterEducationalUrls(sitemap.urls);

PageFetcher

Fetch web pages with concurrency control and rate limiting.

const fetcher = new PageFetcher({
  maxConcurrency: 5,
  timeout: 30000,
  delayBetweenRequests: 100,
  retryAttempts: 2,
  retryDelay: 1000,
  logger: (msg, level) => console.log(`[${level}] ${msg}`)
});

// Fetch multiple pages
const results = await fetcher.fetchPages(urls);

// Fetch single page
const result = await fetcher.fetchSinglePage('https://example.com/page');

// Validate URL
const isValid = PageFetcher.isValidUrl(url);

// Filter valid URLs
const validUrls = PageFetcher.filterValidUrls(urls);

MetadataExtractor

Extract metadata including JSON-LD from HTML pages.

const extractor = new MetadataExtractor({
  validateSchema: false,
  logger: (msg, level) => console.log(`[${level}] ${msg}`)
});

// Extract from multiple pages
const metadata = await extractor.extractFromPages(fetchResults);

// Extract from single page
const metadata = await extractor.extractFromPage(fetchResult);

// Filter educational metadata
const educational = MetadataExtractor.filterEducationalMetadata(metadata);

// Check if has valid content
const hasContent = MetadataExtractor.hasValidContent(metadata);

Types

All types are exported and can be imported:

import type {
  SitemapUrl,
  ParsedSitemap,
  FetchResult,
  FetchOptions,
  ExtractedMetadata,
  LoggerFunction,
} from 'amb-sitemap-parser';

Tree-Shakeable Imports

Import only what you need for smaller bundle sizes:

import { SitemapParser } from 'amb-sitemap-parser/sitemap';
import { PageFetcher } from 'amb-sitemap-parser/fetcher';
import { MetadataExtractor } from 'amb-sitemap-parser/extractor';

Examples

Complete Workflow

import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';

async function processSitemap(sitemapUrl: string) {
  // Initialize components
  const parser = new SitemapParser();
  const fetcher = new PageFetcher({ maxConcurrency: 5 });
  const extractor = new MetadataExtractor();

  // Parse sitemap
  const sitemap = await parser.parseFromUrl(sitemapUrl);
  console.log(`Found ${sitemap.urls.length} URLs`);

  // Limit to first 50 URLs
  const urlsToProcess = sitemap.urls.slice(0, 50);

  // Fetch pages
  const results = await fetcher.fetchPages(urlsToProcess);
  console.log(`Fetched ${results.filter(r => r.success).length} pages successfully`);

  // Extract metadata
  const metadata = await extractor.extractFromPages(results);
  
  // Filter educational content
  const educational = MetadataExtractor.filterEducationalMetadata(metadata);
  console.log(`Found ${educational.length} educational resources`);

  return educational;
}

With Custom Logger

const logger = (message: string, level: 'info' | 'warn' | 'error') => {
  const timestamp = new Date().toISOString();
  console.log(`[${timestamp}] [${level.toUpperCase()}] ${message}`);
};

const parser = new SitemapParser({ logger });
const fetcher = new PageFetcher({ logger, maxConcurrency: 3 });
const extractor = new MetadataExtractor({ logger });

Development

# Install dependencies
npm install

# Build the library
npm run build

# Run tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with UI
npm run test:ui

# Generate coverage report
npm run test:coverage

# Lint code
npm run lint

# Format code
npm run format

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published