Skip to content

Commit 208c8e9

Browse files
Feat/blog mcp (#92)
* Add blog-mcp module with content extraction and deduplication features - Introduced a new workspace for the blog-mcp module, including package.json and README. - Implemented content extraction using Firecrawl and state persistence with Supabase. - Added tools for processing URLs, checking updates, and managing watermarks. - Created necessary TypeScript configurations and utility functions for content normalization and fingerprinting. - Updated bun.lock and package.json to include new dependencies for the blog-mcp module. * Integrate Supabase MCP binding for storage operations - Added SUPABASE binding in wrangler.toml for enhanced database integration. - Updated StateSchema to reflect the use of Supabase via MCP binding. - Refactored Supabase client implementation to utilize the new binding for state persistence, including content management and watermark handling. - Adjusted client retrieval in blog tools to accommodate the new storage method. * feat(content-scraper): add initial implementation for content extraction and summarization MCP - Introduced package.json for project dependencies and scripts. - Added README.md detailing features, setup instructions, and available tools. - Implemented TypeScript configuration in tsconfig.json. - Created Vite configuration for building the project. - Developed core server logic in main.ts, including state schema and environment setup. - Implemented content processing utilities in content.ts for normalization and summarization. - Added Firecrawl API client in firecrawl.ts for content extraction. - Integrated Supabase for state persistence in supabase.ts. - Developed tools for processing URLs, checking updates, and managing watermarks in scraper.ts. - Generated types for Deco integration in deco.gen.ts. - Established project structure with organized directories for server, shared, and tools. * refactor(content-scraper): transition to n8n for content scraping and remove unused components - Updated the content scraper to utilize n8n for web content extraction. - Removed Firecrawl API client and related content processing utilities. - Simplified the state schema and eliminated contract-related types and bindings. - Streamlined the tools for scraping content, focusing on the new n8n integration. - Cleaned up the project by deleting unused files and code related to previous implementations. * refactor(content-scraper): remove deprecated types and interfaces from deco.gen.ts - Eliminated unused types and interfaces from the generated TypeScript file. - Cleaned up the file to streamline the codebase and improve maintainability. - Ensured that the file remains auto-generated and should not be manually edited. * refactor(content-scraper): update scraper tool to use createPrivateTool - Replaced deprecated createTool with createPrivateTool in scraper.ts. - Adjusted function signature for scrapeContentTool to improve clarity and maintainability. * refactor(content-scraper): remove deco.gen.ts file - Deleted the deco.gen.ts file as part of the ongoing cleanup and refactoring efforts. - This removal aligns with the previous commit to eliminate deprecated types and interfaces, further streamlining the codebase. * refactor(content-scraper): reintroduce deco.gen.ts with updated types and schemas - Added a new deco.gen.ts file containing generated types for MCP and an empty StateSchema. - Defined the Env interface for environment variables and included an empty Scopes object. - This update aligns with the ongoing efforts to streamline the codebase while providing necessary type definitions for future development. * refactor(content-scraper): rename blog-mcp to content-scraper and update dependencies - Renamed the workspace from "blog-mcp" to "content-scraper" in package.json and bun.lock. - Updated dependencies in content-scraper's package.json, including @decocms/runtime to version ^1.1.0 and zod to version ^4.0.0. - Introduced app.json for the content-scraper with connection details and description. - Removed outdated Vite and Wrangler configuration files to streamline the project structure. - Adjusted TypeScript configuration to reflect the new project structure and dependencies. * refactor(content-scraper): update build scripts in package.json - Introduced a new build script for the server targeting Bun, allowing for a more streamlined build process. - Updated the existing build script to run the new server build command, enhancing project structure and maintainability. * refactor(content-scraper): update scraper tool to use dynamic n8n webhook URL - Modified the scrapeContentTool to retrieve the n8n webhook URL from the environment state instead of a hardcoded value. - Updated the StateSchema to include the n8nWebhookUrl, enhancing flexibility and configurability for content scraping.
1 parent 761895d commit 208c8e9

11 files changed

Lines changed: 1272 additions & 80 deletions

File tree

bun.lock

Lines changed: 921 additions & 79 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

content-scraper/README.md

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Content Scraper MCP
2+
3+
MCP para extração, deduplicação e sumarização de conteúdo web usando Firecrawl e Supabase.
4+
5+
## Funcionalidades
6+
7+
- **Extração de Conteúdo**: Usa Firecrawl para extrair title, body, author e date de URLs
8+
- **Deduplicação por Fingerprint**: Gera hash SHA-256 de title + body para identificar conteúdo único
9+
- **Persistência de Estado**: Armazena registros no Supabase para evitar reprocessamento
10+
- **Watermarks por Domínio**: Rastreia última vez que cada domínio foi processado
11+
- **Resumos Focados em Insights**: Gera resumos curtos extraindo frases-chave
12+
13+
## Configuração
14+
15+
### 1. Supabase - Criar Tabelas
16+
17+
Execute no SQL Editor do seu projeto Supabase:
18+
19+
```sql
20+
-- Tabela de conteúdo processado
21+
CREATE TABLE scraped_content (
22+
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
23+
url TEXT UNIQUE NOT NULL,
24+
fingerprint TEXT NOT NULL,
25+
domain TEXT NOT NULL,
26+
title TEXT NOT NULL,
27+
first_seen_at TIMESTAMPTZ NOT NULL,
28+
last_seen_at TIMESTAMPTZ NOT NULL,
29+
updated_count INTEGER DEFAULT 0,
30+
created_at TIMESTAMPTZ DEFAULT NOW()
31+
);
32+
33+
-- Índices para performance
34+
CREATE INDEX idx_scraped_content_domain ON scraped_content(domain);
35+
CREATE INDEX idx_scraped_content_fingerprint ON scraped_content(fingerprint);
36+
CREATE INDEX idx_scraped_content_url ON scraped_content(url);
37+
38+
-- Tabela de watermarks por domínio
39+
CREATE TABLE scraper_watermarks (
40+
domain TEXT PRIMARY KEY,
41+
last_processed_at TIMESTAMPTZ NOT NULL,
42+
created_at TIMESTAMPTZ DEFAULT NOW()
43+
);
44+
```
45+
46+
### 2. Firecrawl API Key
47+
48+
Obtenha sua API key em https://firecrawl.dev
49+
50+
### 3. Instalar o MCP
51+
52+
Ao instalar, preencha:
53+
- `firecrawlApiKey`: Sua chave de API do Firecrawl
54+
- `supabaseUrl`: URL do seu projeto Supabase (ex: https://xxx.supabase.co)
55+
- `supabaseKey`: Service role key ou anon key com RLS configurado
56+
57+
## Tools Disponíveis
58+
59+
### `process_urls`
60+
61+
Processa uma lista de URLs:
62+
- Extrai conteúdo limpo usando Firecrawl
63+
- Gera fingerprint único (SHA-256 de title + body normalizado)
64+
- Verifica se já existe no Supabase
65+
- Salva novo conteúdo ou atualiza se fingerprint mudou
66+
- Retorna resumo focado em insights
67+
68+
**Input:**
69+
```json
70+
{
71+
"urls": ["https://example.com/article-1", "https://example.com/article-2"],
72+
"generateSummaries": true
73+
}
74+
```
75+
76+
**Output:**
77+
```json
78+
{
79+
"processed": [
80+
{
81+
"url": "https://example.com/article-1",
82+
"status": "new",
83+
"title": "Article Title",
84+
"summary": "Key insights from the content...",
85+
"fingerprint": "abc123...",
86+
"domain": "example.com"
87+
}
88+
],
89+
"stats": {
90+
"total": 2,
91+
"new": 1,
92+
"updated": 0,
93+
"unchanged": 1,
94+
"errors": 0
95+
}
96+
}
97+
```
98+
99+
### `check_updates`
100+
101+
Verifica status de URLs processadas anteriormente sem re-extrair:
102+
103+
**Input:**
104+
```json
105+
{
106+
"domain": "example.com"
107+
}
108+
```
109+
110+
### `get_watermarks`
111+
112+
Obtém watermarks (última vez processada) por domínio:
113+
114+
**Input:**
115+
```json
116+
{
117+
"domain": "example.com"
118+
}
119+
```
120+
121+
## Lógica de Deduplicação
122+
123+
1. **Normalização**: title e body são normalizados (lowercase, whitespace colapsado, Unicode normalizado)
124+
2. **Fingerprint**: SHA-256 do texto normalizado `title|body`
125+
3. **Verificação**:
126+
- Se URL não existe → conteúdo **novo**
127+
- Se URL existe mas fingerprint diferente → **update**
128+
- Se URL existe e fingerprint igual → **ignorar**
129+
130+
## Desenvolvimento
131+
132+
```bash
133+
cd content-scraper
134+
bun install
135+
bun run dev # Desenvolvimento local
136+
bun run deploy # Deploy para produção
137+
```
138+
139+
## Arquitetura
140+
141+
```
142+
content-scraper/
143+
├── server/
144+
│ ├── main.ts # Entry point e StateSchema
145+
│ ├── lib/
146+
│ │ ├── firecrawl.ts # Cliente Firecrawl API
147+
│ │ ├── supabase.ts # Cliente Supabase para persistência
148+
│ │ ├── content.ts # Normalização, fingerprint, resumo
149+
│ │ └── types.ts # Tipos compartilhados
150+
│ └── tools/
151+
│ ├── index.ts # Exporta todas as tools
152+
│ └── scraper.ts # Tools de processamento
153+
├── shared/
154+
│ └── deco.gen.ts # Tipos gerados
155+
├── package.json
156+
├── wrangler.toml
157+
└── tsconfig.json
158+
```

content-scraper/app.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"scopeName": "deco",
3+
"name": "content-scraper",
4+
"friendlyName": "Content Scraper",
5+
"connection": {
6+
"type": "HTTP",
7+
"url": "https://content-scraper.decocache.com/mcp"
8+
},
9+
"description": "Scrape web content from URLs using n8n workflow automation.",
10+
"icon": "https://assets.decocache.com/mcp/content-scraper-icon.svg",
11+
"unlisted": false
12+
}

content-scraper/package.json

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
{
2+
"name": "mcp-content-scraper",
3+
"version": "1.0.0",
4+
"description": "Content extraction, deduplication and summarization MCP using Firecrawl and Supabase",
5+
"private": true,
6+
"type": "module",
7+
"scripts": {
8+
"dev": "deco dev --vite",
9+
"configure": "deco configure",
10+
"gen": "deco gen --output=shared/deco.gen.ts",
11+
"deploy": "npm run build && deco deploy ./dist/server",
12+
"check": "tsc --noEmit",
13+
"build:server": "NODE_ENV=production bun build server/main.ts --target=bun --outfile=dist/server/main.js",
14+
"build": "bun run build:server"
15+
},
16+
"dependencies": {
17+
"@decocms/runtime": "^1.1.0",
18+
"@supabase/supabase-js": "^2.49.0",
19+
"zod": "^4.0.0"
20+
},
21+
"devDependencies": {
22+
"@cloudflare/vite-plugin": "^1.13.4",
23+
"@cloudflare/workers-types": "^4.20251014.0",
24+
"@decocms/mcps-shared": "1.0.0",
25+
"@mastra/core": "^0.24.0",
26+
"@modelcontextprotocol/sdk": "^1.21.0",
27+
"@types/mime-db": "^1.43.6",
28+
"deco-cli": "^0.26.0",
29+
"typescript": "^5.7.2",
30+
"vite": "7.2.0",
31+
"wrangler": "^4.28.0"
32+
},
33+
"engines": {
34+
"node": ">=22.0.0"
35+
}
36+
}
37+

content-scraper/server/main.ts

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
/**
2+
* Content Scraper MCP
3+
*
4+
* Simple MCP that scrapes web content via n8n webhook.
5+
*/
6+
import { serve } from "@decocms/mcps-shared/serve";
7+
import { withRuntime } from "@decocms/runtime";
8+
import { tools } from "./tools/index.ts";
9+
import { type Env, StateSchema } from "./types/env.ts";
10+
11+
export type { Env };
12+
export { StateSchema };
13+
14+
const runtime = withRuntime<Env, typeof StateSchema>({
15+
configuration: {
16+
scopes: [],
17+
state: StateSchema,
18+
},
19+
tools,
20+
});
21+
22+
serve(runtime.fetch);
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
/**
2+
* Central export point for all tools.
3+
*/
4+
import { scraperTools } from "./scraper.ts";
5+
6+
// Export all tools
7+
export const tools = [...scraperTools];
8+
9+
// Re-export domain-specific tools for direct access if needed
10+
export { scraperTools } from "./scraper.ts";
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
/**
2+
* Content scraping tool via n8n webhook.
3+
*/
4+
import { z } from "zod";
5+
import { createPrivateTool } from "@decocms/runtime/tools";
6+
import type { Env } from "../types/env.ts";
7+
8+
/**
9+
* Call the n8n webhook to scrape content from a URL.
10+
*/
11+
export const scrapeContentTool = (env: Env) =>
12+
createPrivateTool({
13+
id: "scrape_content",
14+
description:
15+
"Scrape content from a URL using the n8n workflow. " +
16+
"Extracts and processes web content through an automated pipeline.",
17+
inputSchema: z.object({
18+
url: z.string().url().describe("The URL to scrape content from"),
19+
}),
20+
outputSchema: z.object({
21+
success: z.boolean(),
22+
data: z.unknown().optional(),
23+
error: z.string().optional(),
24+
}),
25+
execute: async ({ context: input }) => {
26+
try {
27+
const { n8nWebhookUrl } = env.DECO_CHAT_REQUEST_CONTEXT.state;
28+
const url = new URL(n8nWebhookUrl);
29+
url.searchParams.set("url", input.url);
30+
31+
const response = await fetch(url.toString());
32+
33+
if (!response.ok) {
34+
return {
35+
success: false,
36+
error: `Webhook returned ${response.status}: ${response.statusText}`,
37+
};
38+
}
39+
40+
const data = await response.json();
41+
42+
return {
43+
success: true,
44+
data,
45+
};
46+
} catch (error) {
47+
return {
48+
success: false,
49+
error: error instanceof Error ? error.message : "Unknown error",
50+
};
51+
}
52+
},
53+
});
54+
55+
/**
56+
* Export all scraper tools
57+
*/
58+
export const scraperTools = [scrapeContentTool];
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
/**
2+
* Environment Type Definitions
3+
*/
4+
import { type DefaultEnv } from "@decocms/runtime";
5+
import { z } from "zod";
6+
7+
export const StateSchema = z.object({
8+
n8nWebhookUrl: z.string().url().describe("URL do webhook N8N para scraping"),
9+
});
10+
11+
type State = z.infer<typeof StateSchema>;
12+
13+
export type Env = DefaultEnv<typeof StateSchema> & {
14+
DECO_CHAT_REQUEST_CONTEXT: {
15+
state: State;
16+
};
17+
};

content-scraper/tsconfig.json

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"compilerOptions": {
3+
"target": "ES2022",
4+
"useDefineForClassFields": true,
5+
"lib": ["ES2023"],
6+
"module": "ESNext",
7+
"skipLibCheck": true,
8+
9+
/* Bundler mode */
10+
"moduleResolution": "bundler",
11+
"allowImportingTsExtensions": true,
12+
"isolatedModules": true,
13+
"verbatimModuleSyntax": false,
14+
"moduleDetection": "force",
15+
"noEmit": true,
16+
"allowJs": true,
17+
18+
/* Linting */
19+
"strict": true,
20+
"noUnusedLocals": true,
21+
"noUnusedParameters": true,
22+
"noFallthroughCasesInSwitch": true,
23+
"noUncheckedSideEffectImports": true,
24+
25+
/* Path Aliases */
26+
"baseUrl": ".",
27+
"paths": {
28+
"server/*": ["./server/*"]
29+
},
30+
31+
/* Types */
32+
"types": ["@types/node"]
33+
},
34+
"include": ["server"]
35+
}

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
},
2222
"workspaces": [
2323
"apify",
24+
"content-scraper",
2425
"data-for-seo",
2526
"datajud",
2627
"gemini-pro-vision",

0 commit comments

Comments
 (0)