Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
14 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,000 changes: 921 additions & 79 deletions bun.lock

Large diffs are not rendered by default.

158 changes: 158 additions & 0 deletions content-scraper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Content Scraper MCP

MCP para extração, deduplicação e sumarização de conteúdo web usando Firecrawl e Supabase.

## Funcionalidades

- **Extração de Conteúdo**: Usa Firecrawl para extrair title, body, author e date de URLs
- **Deduplicação por Fingerprint**: Gera hash SHA-256 de title + body para identificar conteúdo único
- **Persistência de Estado**: Armazena registros no Supabase para evitar reprocessamento
- **Watermarks por Domínio**: Rastreia última vez que cada domínio foi processado
- **Resumos Focados em Insights**: Gera resumos curtos extraindo frases-chave

## Configuração

### 1. Supabase - Criar Tabelas

Execute no SQL Editor do seu projeto Supabase:

```sql
-- Tabela de conteúdo processado
CREATE TABLE scraped_content (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
fingerprint TEXT NOT NULL,
domain TEXT NOT NULL,
title TEXT NOT NULL,
first_seen_at TIMESTAMPTZ NOT NULL,
last_seen_at TIMESTAMPTZ NOT NULL,
updated_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Índices para performance
CREATE INDEX idx_scraped_content_domain ON scraped_content(domain);
CREATE INDEX idx_scraped_content_fingerprint ON scraped_content(fingerprint);
CREATE INDEX idx_scraped_content_url ON scraped_content(url);

-- Tabela de watermarks por domínio
CREATE TABLE scraper_watermarks (
domain TEXT PRIMARY KEY,
last_processed_at TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
```

### 2. Firecrawl API Key

Obtenha sua API key em https://firecrawl.dev

### 3. Instalar o MCP

Ao instalar, preencha:
- `firecrawlApiKey`: Sua chave de API do Firecrawl
- `supabaseUrl`: URL do seu projeto Supabase (ex: https://xxx.supabase.co)
- `supabaseKey`: Service role key ou anon key com RLS configurado

## Tools Disponíveis

### `process_urls`

Processa uma lista de URLs:
- Extrai conteúdo limpo usando Firecrawl
- Gera fingerprint único (SHA-256 de title + body normalizado)
- Verifica se já existe no Supabase
- Salva novo conteúdo ou atualiza se fingerprint mudou
- Retorna resumo focado em insights

**Input:**
```json
{
"urls": ["https://example.com/article-1", "https://example.com/article-2"],
"generateSummaries": true
}
```

**Output:**
```json
{
"processed": [
{
"url": "https://example.com/article-1",
"status": "new",
"title": "Article Title",
"summary": "Key insights from the content...",
"fingerprint": "abc123...",
"domain": "example.com"
}
],
"stats": {
"total": 2,
"new": 1,
"updated": 0,
"unchanged": 1,
"errors": 0
}
}
```

### `check_updates`

Verifica status de URLs processadas anteriormente sem re-extrair:

**Input:**
```json
{
"domain": "example.com"
}
```

### `get_watermarks`

Obtém watermarks (última vez processada) por domínio:

**Input:**
```json
{
"domain": "example.com"
}
```

## Lógica de Deduplicação

1. **Normalização**: title e body são normalizados (lowercase, whitespace colapsado, Unicode normalizado)
2. **Fingerprint**: SHA-256 do texto normalizado `title|body`
3. **Verificação**:
- Se URL não existe → conteúdo **novo**
- Se URL existe mas fingerprint diferente → **update**
- Se URL existe e fingerprint igual → **ignorar**

## Desenvolvimento

```bash
cd content-scraper
bun install
bun run dev # Desenvolvimento local
bun run deploy # Deploy para produção
```

## Arquitetura

```
content-scraper/
├── server/
│ ├── main.ts # Entry point e StateSchema
│ ├── lib/
│ │ ├── firecrawl.ts # Cliente Firecrawl API
│ │ ├── supabase.ts # Cliente Supabase para persistência
│ │ ├── content.ts # Normalização, fingerprint, resumo
│ │ └── types.ts # Tipos compartilhados
│ └── tools/
│ ├── index.ts # Exporta todas as tools
│ └── scraper.ts # Tools de processamento
├── shared/
│ └── deco.gen.ts # Tipos gerados
├── package.json
├── wrangler.toml
└── tsconfig.json
```
12 changes: 12 additions & 0 deletions content-scraper/app.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"scopeName": "deco",
"name": "content-scraper",
"friendlyName": "Content Scraper",
"connection": {
"type": "HTTP",
"url": "https://content-scraper.decocache.com/mcp"
},
"description": "Scrape web content from URLs using n8n workflow automation.",
"icon": "https://assets.decocache.com/mcp/content-scraper-icon.svg",
"unlisted": false
}
37 changes: 37 additions & 0 deletions content-scraper/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"name": "mcp-content-scraper",
"version": "1.0.0",
"description": "Content extraction, deduplication and summarization MCP using Firecrawl and Supabase",
"private": true,
"type": "module",
"scripts": {
"dev": "deco dev --vite",
"configure": "deco configure",
"gen": "deco gen --output=shared/deco.gen.ts",
"deploy": "npm run build && deco deploy ./dist/server",
"check": "tsc --noEmit",
"build:server": "NODE_ENV=production bun build server/main.ts --target=bun --outfile=dist/server/main.js",
"build": "bun run build:server"
},
"dependencies": {
"@decocms/runtime": "^1.1.0",
"@supabase/supabase-js": "^2.49.0",
"zod": "^4.0.0"
},
"devDependencies": {
"@cloudflare/vite-plugin": "^1.13.4",
"@cloudflare/workers-types": "^4.20251014.0",
"@decocms/mcps-shared": "1.0.0",
"@mastra/core": "^0.24.0",
"@modelcontextprotocol/sdk": "^1.21.0",
"@types/mime-db": "^1.43.6",
"deco-cli": "^0.26.0",
"typescript": "^5.7.2",
"vite": "7.2.0",
"wrangler": "^4.28.0"
},
"engines": {
"node": ">=22.0.0"
}
}

22 changes: 22 additions & 0 deletions content-scraper/server/main.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/**
* Content Scraper MCP
*
* Simple MCP that scrapes web content via n8n webhook.
*/
import { serve } from "@decocms/mcps-shared/serve";
import { withRuntime } from "@decocms/runtime";
import { tools } from "./tools/index.ts";
import { type Env, StateSchema } from "./types/env.ts";

export type { Env };
export { StateSchema };

const runtime = withRuntime<Env, typeof StateSchema>({
configuration: {
scopes: [],
state: StateSchema,
},
tools,
});

serve(runtime.fetch);
10 changes: 10 additions & 0 deletions content-scraper/server/tools/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
/**
* Central export point for all tools.
*/
import { scraperTools } from "./scraper.ts";

// Export all tools
export const tools = [...scraperTools];

// Re-export domain-specific tools for direct access if needed
export { scraperTools } from "./scraper.ts";
58 changes: 58 additions & 0 deletions content-scraper/server/tools/scraper.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
/**
* Content scraping tool via n8n webhook.
*/
import { z } from "zod";
import { createPrivateTool } from "@decocms/runtime/tools";
import type { Env } from "../types/env.ts";

/**
* Call the n8n webhook to scrape content from a URL.
*/
export const scrapeContentTool = (env: Env) =>
createPrivateTool({
id: "scrape_content",
description:
"Scrape content from a URL using the n8n workflow. " +
"Extracts and processes web content through an automated pipeline.",
inputSchema: z.object({
url: z.string().url().describe("The URL to scrape content from"),
}),
outputSchema: z.object({
success: z.boolean(),
data: z.unknown().optional(),
error: z.string().optional(),
}),
execute: async ({ context: input }) => {
try {
const { n8nWebhookUrl } = env.DECO_CHAT_REQUEST_CONTEXT.state;
const url = new URL(n8nWebhookUrl);
url.searchParams.set("url", input.url);

const response = await fetch(url.toString());

if (!response.ok) {
return {
success: false,
error: `Webhook returned ${response.status}: ${response.statusText}`,
};
}

const data = await response.json();

return {
success: true,
data,
};
} catch (error) {
return {
success: false,
error: error instanceof Error ? error.message : "Unknown error",
};
}
},
});

/**
* Export all scraper tools
*/
export const scraperTools = [scrapeContentTool];
17 changes: 17 additions & 0 deletions content-scraper/server/types/env.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/**
* Environment Type Definitions
*/
import { type DefaultEnv } from "@decocms/runtime";
import { z } from "zod";

export const StateSchema = z.object({
n8nWebhookUrl: z.string().url().describe("URL do webhook N8N para scraping"),
});

type State = z.infer<typeof StateSchema>;

export type Env = DefaultEnv<typeof StateSchema> & {
DECO_CHAT_REQUEST_CONTEXT: {
state: State;
};
};
35 changes: 35 additions & 0 deletions content-scraper/tsconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"compilerOptions": {
"target": "ES2022",
"useDefineForClassFields": true,
"lib": ["ES2023"],
"module": "ESNext",
"skipLibCheck": true,

/* Bundler mode */
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"isolatedModules": true,
"verbatimModuleSyntax": false,
"moduleDetection": "force",
"noEmit": true,
"allowJs": true,

/* Linting */
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedSideEffectImports": true,

/* Path Aliases */
"baseUrl": ".",
"paths": {
"server/*": ["./server/*"]
},

/* Types */
"types": ["@types/node"]
},
"include": ["server"]
}
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
},
"workspaces": [
"apify",
"content-scraper",
"data-for-seo",
"datajud",
"gemini-pro-vision",
Expand Down
2 changes: 1 addition & 1 deletion shared/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
"./serve": "./serve.ts"
},
"devDependencies": {
"@decocms/runtime": "0.25.1",
"@decocms/runtime": "^1.1.0",
"@types/bun": "^1.2.14",
"vite": "7.2.0",
"zod": "^4.0.0"
Expand Down
Loading