A modern TypeScript library for parsing sitemaps and extracting educational metadata from web pages. Built with pure ESM for Node.js 20+.
- 📋 Parse XML sitemaps (regular sitemaps and sitemap indexes)
- 🌐 Concurrent page fetching with rate limiting
- 🔍 Extract JSON-LD metadata from HTML pages
- 📊 Filter educational content automatically
- 🎯 Full TypeScript support with exported types
- 🚀 Pure ESM for modern JavaScript environments
- 🧪 Tested with Vitest
npm install amb-sitemap-parser- Node.js >= 20.0.0
- ESM-compatible project (set
"type": "module"in package.json)
The package includes a command-line interface for parsing sitemaps and extracting educational metadata.
# Install globally for CLI usage
npm install -g amb-sitemap-parser
# Or use with npx (no installation needed)
npx amb-sitemap-parser --help# Basic usage
amb-sitemap-parser parse https://example.com/sitemap.xml
# Limit to first 10 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 10
# Pretty-print JSON output
amb-sitemap-parser parse https://example.com/sitemap.xml --pretty
# Combine options
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 20 --prettyOutput:
{
"urls": ["url1", "url2", "url3"],
"isIndex": false,
"count": 3
}Full pipeline: parse sitemap → fetch pages → extract metadata → filter educational content
# Basic usage
amb-sitemap-parser extract https://example.com/sitemap.xml
# With verbose logging (logs to stderr)
amb-sitemap-parser extract https://example.com/sitemap.xml --verbose
# Limit URLs and adjust concurrency
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 20 --max-concurrency 10
# Pretty-print output
amb-sitemap-parser extract https://example.com/sitemap.xml --pretty
# Custom timeout (in milliseconds)
amb-sitemap-parser extract https://example.com/sitemap.xml --timeout 60000
# Save to file
amb-sitemap-parser extract https://example.com/sitemap.xml > results.jsonOutput:
{
"metadata": [
{
"url": "https://example.com/course",
"title": "Introduction to Python",
"description": "Learn Python basics",
"jsonLdData": [
{
"@type": "Course",
"name": "Introduction to Python"
}
],
"extractionTime": 45
}
],
"summary": {
"totalUrls": 100,
"fetched": 95,
"withMetadata": 78,
"educational": 42
}
}Extract educational metadata from specific URLs without needing a sitemap.
# Fetch a single URL
amb-sitemap-parser fetch https://example.com/course
# Fetch multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2
# With verbose logging
amb-sitemap-parser fetch https://example.com/course --verbose
# Adjust concurrency for multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2 --max-concurrency 10
# Pretty-print output
amb-sitemap-parser fetch https://example.com/course --pretty
# Custom timeout (in milliseconds)
amb-sitemap-parser fetch https://example.com/course --timeout 60000
# Save to file
amb-sitemap-parser fetch https://example.com/course > result.jsonOutput: (same format as extract command)
{
"metadata": [
{
"url": "https://example.com/course",
"title": "Introduction to Python",
"description": "Learn Python basics",
"jsonLdData": [
{
"@type": "Course",
"name": "Introduction to Python"
}
],
"extractionTime": 45
}
],
"summary": {
"totalUrls": 1,
"fetched": 1,
"withMetadata": 1,
"educational": 1
}
}-l, --limit <number>- Limit to first N URLs-p, --pretty- Pretty-print JSON output
-l, --limit <number>- Limit URLs to process-c, --max-concurrency <number>- Maximum concurrent requests (default: 5)-t, --timeout <number>- Request timeout in milliseconds (default: 30000)-o, --output <filepath>- Save metadata to JSONL file (one JSON-LD object per line)-p, --pretty- Pretty-print JSON output-v, --verbose- Show progress logs (written to stderr)
-c, --max-concurrency <number>- Maximum concurrent requests (default: 5)-t, --timeout <number>- Request timeout in milliseconds (default: 30000)-o, --output <filepath>- Save metadata to JSONL file (one JSON-LD object per line)-p, --pretty- Pretty-print JSON output-v, --verbose- Show progress logs (written to stderr)
When using the --output option, metadata is saved in JSON Lines (JSONL) format - one JSON-LD object per line. This format is ideal for streaming, piping, and processing with standard Unix tools.
Each JSON-LD object from a page becomes a separate line in the file:
{"url":"https://example.com/course1","jsonLd":{"@type":"Course","name":"Python 101"}}
{"url":"https://example.com/course1","jsonLd":{"@type":"LearningResource","name":"Exercise 1"}}
{"url":"https://example.com/course2","jsonLd":{"@type":"Course","name":"JavaScript Basics"}}Key benefits:
- ✅ Each line is a complete, independent record
- ✅ Stream-friendly (process one record at a time)
- ✅ Perfect for Unix pipelines (
jq,grep,awk) - ✅ Easy to parallelize processing
- ✅ If a page has multiple JSON-LD objects, each gets its own line
With --output flag:
amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl- File (results.jsonl): Contains JSONL records (one JSON-LD object per line)
- stdout: Summary statistics only
{ "summary": { "totalUrls": 100, "fetched": 95, "withMetadata": 78, "educational": 45, "recordsWritten": 67 } }
Without --output flag:
amb-sitemap-parser extract https://example.com/sitemap.xml- stdout: Full metadata array + summary (existing behavior)
# Quick test: parse and see first 5 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 5 --pretty
# Extract metadata from first 10 URLs with verbose logging
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 10 --verbose --pretty
# High-performance extraction: 20 concurrent requests
amb-sitemap-parser extract https://example.com/sitemap.xml --max-concurrency 20
# Save to JSONL file
amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl --verbose
# Save from direct URL fetch
amb-sitemap-parser fetch https://example.com/course --output course.jsonl
# Pipeline with jq for further processing
amb-sitemap-parser extract https://example.com/sitemap.xml | jq '.metadata[] | select(.jsonLd != null)'The JSONL format enables powerful command-line workflows:
# Save metadata to file
amb-sitemap-parser extract https://example.com/sitemap.xml -o results.jsonl -v
# Count total records
wc -l < results.jsonl
# Filter by type: only Courses
cat results.jsonl | jq 'select(.jsonLd."@type" == "Course")'
# Extract all course names
cat results.jsonl | jq -r '.jsonLd.name' | sort | uniq
# Count by type
cat results.jsonl | jq -r '.jsonLd."@type"' | sort | uniq -c
# Find courses from specific domain
cat results.jsonl | jq 'select(.url | contains("example.com"))'
# Get URLs with specific property
cat results.jsonl | jq -r 'select(.jsonLd.educationalLevel) | .url'
# Process one record at a time
cat results.jsonl | while read line; do
echo "$line" | jq '.jsonLd.name'
done
# Parallel processing with GNU parallel
cat results.jsonl | parallel --pipe -L1 'echo {} | jq ".jsonLd.name"'
# Convert to CSV (name and URL only)
cat results.jsonl | jq -r '[.jsonLd.name, .url] | @csv'
# Filter and save to new file
cat results.jsonl | jq 'select(.jsonLd."@type" == "Course")' > courses-only.jsonlConvert extracted AMB resources directly to Nostr events using the --jsonl flag for seamless streaming between tools.
Install the AMB-Nostr converter:
npm install -g @edufeed-org/amb-nostr-converter
# Or use npx for on-demand usageThe --jsonl flag outputs one AMB resource per line to stdout, perfect for piping:
# Extract and convert a single resource
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl --limit 1 | \
npx amb-convert amb:nostr --pretty
# Extract from direct URL
amb-sitemap-parser fetch https://example.com/course --jsonl | \
npx amb-convert amb:nostr --prettyConvert all resources from a sitemap to Nostr events:
⚠️ Important: When using awhile readloop with piped input,amb-convertcan consume stdin from the outer pipeline, causing most lines to be skipped. Always save to a temp file first, then process:
# ✅ CORRECT: Save to temp file first, then process
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl > "$TEMP_FILE"
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr 2>/dev/null
done < "$TEMP_FILE" > nostr-events.jsonl
# ❌ WRONG: Direct piping loses most lines due to stdin consumption
# amb-sitemap-parser extract url --jsonl | while read resource; do ...Why does this happen? When amb-convert runs inside the while read loop, it reads from stdin and inadvertently consumes input from the outer pipeline that was meant for the read command. This causes most lines to be skipped.
With verbose logging:
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl -v 2>&1 | tee /dev/stderr > "$TEMP_FILE"
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr 2>/dev/null
done < "$TEMP_FILE" > nostr-events.jsonlSign events with your Nostr private key:
# Save to temp file first (required to avoid stdin consumption issues)
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl > "$TEMP_FILE"
# Using nsec (bech32 format)
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC 2>/dev/null
done < "$TEMP_FILE" > signed-events.jsonl
# Using hex private key
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr --private-key $NOSTR_PRIVATE_KEY 2>/dev/null
done < "$TEMP_FILE" > signed-events.jsonl# 1. Test with a single URL first (single resource is OK to pipe directly)
amb-sitemap-parser fetch https://example.com/course --jsonl -v | \
npx amb-convert amb:nostr --pretty
# 2. Process multiple resources (MUST use temp file approach)
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl -l 10 -v > "$TEMP_FILE"
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC 2>/dev/null
done < "$TEMP_FILE" > events.jsonl
# 3. Full sitemap conversion with signed events
TEMP_FILE=$(mktemp)
trap "rm -f $TEMP_FILE" EXIT
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl --max-concurrency 10 -v 2> extraction.log > "$TEMP_FILE"
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC --pretty 2>/dev/null
done < "$TEMP_FILE" > all-events.jsonl
# 4. Filter and convert only Course types (jq doesn't consume stdin, so intermediate file is safe)
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl > amb-resources.jsonl
jq -c 'select(.type | contains(["Course"]))' amb-resources.jsonl > courses.jsonl
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr 2>/dev/null
done < courses.jsonl > courses-events.jsonl
# 5. Save AMB resources and convert separately (recommended approach)
amb-sitemap-parser extract https://example.com/sitemap.xml -o amb-resources.jsonl -v
while IFS= read -r resource; do
echo "$resource" | npx amb-convert amb:nostr --nsec $NOSTR_NSEC 2>/dev/null
done < amb-resources.jsonl > nostr-events.jsonlThe --jsonl flag is specifically designed for tool integration:
Without --jsonl (default):
{
"metadata": [
{ "url": "...", "jsonLd": [{...}] },
{ "url": "...", "jsonLd": [{...}] }
],
"summary": {...}
}❌ Complex nested structure, hard to pipe
With --jsonl:
{AMB resource 1}
{AMB resource 2}
{AMB resource 3}
✅ One resource per line, perfect for streaming
- ✅ Streaming: Process resources one at a time, low memory usage
- ✅ Composable: Combine with standard Unix tools (
jq,grep,awk) - ✅ Resumable: Stop and resume processing at any point
- ✅ Flexible: Filter, transform, and route data as needed
- ✅ Observable: Use verbose mode (
-v) to monitor progress via stderr
Issue: "Cannot use --output and --jsonl together"
# ❌ Wrong
amb-sitemap-parser extract url --output file.jsonl --jsonl
# ✅ Correct (save to file)
amb-sitemap-parser extract url --output file.jsonl
# ✅ Correct (pipe to stdout)
amb-sitemap-parser extract url --jsonl > file.jsonlIssue: No output when piping
# Make sure to use --jsonl flag for streaming
amb-sitemap-parser extract url --jsonl | npx amb-convert amb:nostrIssue: Need to see progress while piping
# Use --verbose flag (logs go to stderr, data to stdout)
amb-sitemap-parser extract url --jsonl -v | npx amb-convert amb:nostr 2>&1import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';
// Parse a sitemap
const parser = new SitemapParser();
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');
// Fetch pages
const fetcher = new PageFetcher({ maxConcurrency: 5 });
const results = await fetcher.fetchPages(sitemap.urls.slice(0, 10));
// Extract metadata
const extractor = new MetadataExtractor();
const metadata = await extractor.extractFromPages(results);
// Filter educational content
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
console.log(educational);Parse XML sitemaps and extract URLs.
const parser = new SitemapParser({
logger: (msg, level) => console.log(`[${level}] ${msg}`)
});
// Parse from string
const sitemap = await parser.parseSitemap(xmlContent);
// Parse from URL
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');
// Validate URL
const isValid = SitemapParser.isValidSitemapUrl(url);
// Filter educational URLs
const filtered = SitemapParser.filterEducationalUrls(sitemap.urls);Fetch web pages with concurrency control and rate limiting.
const fetcher = new PageFetcher({
maxConcurrency: 5,
timeout: 30000,
delayBetweenRequests: 100,
retryAttempts: 2,
retryDelay: 1000,
logger: (msg, level) => console.log(`[${level}] ${msg}`)
});
// Fetch multiple pages
const results = await fetcher.fetchPages(urls);
// Fetch single page
const result = await fetcher.fetchSinglePage('https://example.com/page');
// Validate URL
const isValid = PageFetcher.isValidUrl(url);
// Filter valid URLs
const validUrls = PageFetcher.filterValidUrls(urls);Extract metadata including JSON-LD from HTML pages.
const extractor = new MetadataExtractor({
validateSchema: false,
logger: (msg, level) => console.log(`[${level}] ${msg}`)
});
// Extract from multiple pages
const metadata = await extractor.extractFromPages(fetchResults);
// Extract from single page
const metadata = await extractor.extractFromPage(fetchResult);
// Filter educational metadata
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
// Check if has valid content
const hasContent = MetadataExtractor.hasValidContent(metadata);All types are exported and can be imported:
import type {
SitemapUrl,
ParsedSitemap,
FetchResult,
FetchOptions,
ExtractedMetadata,
LoggerFunction,
} from 'amb-sitemap-parser';Import only what you need for smaller bundle sizes:
import { SitemapParser } from 'amb-sitemap-parser/sitemap';
import { PageFetcher } from 'amb-sitemap-parser/fetcher';
import { MetadataExtractor } from 'amb-sitemap-parser/extractor';import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';
async function processSitemap(sitemapUrl: string) {
// Initialize components
const parser = new SitemapParser();
const fetcher = new PageFetcher({ maxConcurrency: 5 });
const extractor = new MetadataExtractor();
// Parse sitemap
const sitemap = await parser.parseFromUrl(sitemapUrl);
console.log(`Found ${sitemap.urls.length} URLs`);
// Limit to first 50 URLs
const urlsToProcess = sitemap.urls.slice(0, 50);
// Fetch pages
const results = await fetcher.fetchPages(urlsToProcess);
console.log(`Fetched ${results.filter(r => r.success).length} pages successfully`);
// Extract metadata
const metadata = await extractor.extractFromPages(results);
// Filter educational content
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
console.log(`Found ${educational.length} educational resources`);
return educational;
}const logger = (message: string, level: 'info' | 'warn' | 'error') => {
const timestamp = new Date().toISOString();
console.log(`[${timestamp}] [${level.toUpperCase()}] ${message}`);
};
const parser = new SitemapParser({ logger });
const fetcher = new PageFetcher({ logger, maxConcurrency: 3 });
const extractor = new MetadataExtractor({ logger });# Install dependencies
npm install
# Build the library
npm run build
# Run tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with UI
npm run test:ui
# Generate coverage report
npm run test:coverage
# Lint code
npm run lint
# Format code
npm run formatMIT
Contributions are welcome! Please feel free to submit a Pull Request.