Feat/blog mcp (#92)

viniciusventura29 · web-flow · commit 208c8e91c33c · 2026-01-07T17:23:43.000-03:00
* Add blog-mcp module with content extraction and deduplication features

- Introduced a new workspace for the blog-mcp module, including package.json and README.
- Implemented content extraction using Firecrawl and state persistence with Supabase.
- Added tools for processing URLs, checking updates, and managing watermarks.
- Created necessary TypeScript configurations and utility functions for content normalization and fingerprinting.
- Updated bun.lock and package.json to include new dependencies for the blog-mcp module.

* Integrate Supabase MCP binding for storage operations

- Added SUPABASE binding in wrangler.toml for enhanced database integration.
- Updated StateSchema to reflect the use of Supabase via MCP binding.
- Refactored Supabase client implementation to utilize the new binding for state persistence, including content management and watermark handling.
- Adjusted client retrieval in blog tools to accommodate the new storage method.

* feat(content-scraper): add initial implementation for content extraction and summarization MCP

- Introduced package.json for project dependencies and scripts.
- Added README.md detailing features, setup instructions, and available tools.
- Implemented TypeScript configuration in tsconfig.json.
- Created Vite configuration for building the project.
- Developed core server logic in main.ts, including state schema and environment setup.
- Implemented content processing utilities in content.ts for normalization and summarization.
- Added Firecrawl API client in firecrawl.ts for content extraction.
- Integrated Supabase for state persistence in supabase.ts.
- Developed tools for processing URLs, checking updates, and managing watermarks in scraper.ts.
- Generated types for Deco integration in deco.gen.ts.
- Established project structure with organized directories for server, shared, and tools.

* refactor(content-scraper): transition to n8n for content scraping and remove unused components

- Updated the content scraper to utilize n8n for web content extraction.
- Removed Firecrawl API client and related content processing utilities.
- Simplified the state schema and eliminated contract-related types and bindings.
- Streamlined the tools for scraping content, focusing on the new n8n integration.
- Cleaned up the project by deleting unused files and code related to previous implementations.

* refactor(content-scraper): remove deprecated types and interfaces from deco.gen.ts

- Eliminated unused types and interfaces from the generated TypeScript file.
- Cleaned up the file to streamline the codebase and improve maintainability.
- Ensured that the file remains auto-generated and should not be manually edited.

* refactor(content-scraper): update scraper tool to use createPrivateTool

- Replaced deprecated createTool with createPrivateTool in scraper.ts.
- Adjusted function signature for scrapeContentTool to improve clarity and maintainability.

* refactor(content-scraper): remove deco.gen.ts file

- Deleted the deco.gen.ts file as part of the ongoing cleanup and refactoring efforts.
- This removal aligns with the previous commit to eliminate deprecated types and interfaces, further streamlining the codebase.

* refactor(content-scraper): reintroduce deco.gen.ts with updated types and schemas

- Added a new deco.gen.ts file containing generated types for MCP and an empty StateSchema.
- Defined the Env interface for environment variables and included an empty Scopes object.
- This update aligns with the ongoing efforts to streamline the codebase while providing necessary type definitions for future development.

* refactor(content-scraper): rename blog-mcp to content-scraper and update dependencies

- Renamed the workspace from "blog-mcp" to "content-scraper" in package.json and bun.lock.
- Updated dependencies in content-scraper's package.json, including @decocms/runtime to version ^1.1.0 and zod to version ^4.0.0.
- Introduced app.json for the content-scraper with connection details and description.
- Removed outdated Vite and Wrangler configuration files to streamline the project structure.
- Adjusted TypeScript configuration to reflect the new project structure and dependencies.

* refactor(content-scraper): update build scripts in package.json

- Introduced a new build script for the server targeting Bun, allowing for a more streamlined build process.
- Updated the existing build script to run the new server build command, enhancing project structure and maintainability.

* refactor(content-scraper): update scraper tool to use dynamic n8n webhook URL

- Modified the scrapeContentTool to retrieve the n8n webhook URL from the environment state instead of a hardcoded value.
- Updated the StateSchema to include the n8nWebhookUrl, enhancing flexibility and configurability for content scraping.
diff --git a/bun.lock b/bun.lock
diff --git a/content-scraper/README.md b/content-scraper/README.md
@@ -0,0 +1,158 @@
+# Content Scraper MCP
+
+MCP para extração, deduplicação e sumarização de conteúdo web usando Firecrawl e Supabase.
+
+## Funcionalidades
+
+- **Extração de Conteúdo**: Usa Firecrawl para extrair title, body, author e date de URLs
+- **Deduplicação por Fingerprint**: Gera hash SHA-256 de title + body para identificar conteúdo único
+- **Persistência de Estado**: Armazena registros no Supabase para evitar reprocessamento
+- **Watermarks por Domínio**: Rastreia última vez que cada domínio foi processado
+- **Resumos Focados em Insights**: Gera resumos curtos extraindo frases-chave
+
+## Configuração
+
+### 1. Supabase - Criar Tabelas
+
+Execute no SQL Editor do seu projeto Supabase:
+
+```sql
+-- Tabela de conteúdo processado
+CREATE TABLE scraped_content (
+  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
+  url TEXT UNIQUE NOT NULL,
+  fingerprint TEXT NOT NULL,
+  domain TEXT NOT NULL,
+  title TEXT NOT NULL,
+  first_seen_at TIMESTAMPTZ NOT NULL,
+  last_seen_at TIMESTAMPTZ NOT NULL,
+  updated_count INTEGER DEFAULT 0,
+  created_at TIMESTAMPTZ DEFAULT NOW()
+);
+
+-- Índices para performance
+CREATE INDEX idx_scraped_content_domain ON scraped_content(domain);
+CREATE INDEX idx_scraped_content_fingerprint ON scraped_content(fingerprint);
+CREATE INDEX idx_scraped_content_url ON scraped_content(url);
+
+-- Tabela de watermarks por domínio
+CREATE TABLE scraper_watermarks (
+  domain TEXT PRIMARY KEY,
+  last_processed_at TIMESTAMPTZ NOT NULL,
+  created_at TIMESTAMPTZ DEFAULT NOW()
+);
+```
+
+### 2. Firecrawl API Key
+
+Obtenha sua API key em https://firecrawl.dev
+
+### 3. Instalar o MCP
+
+Ao instalar, preencha:
+- `firecrawlApiKey`: Sua chave de API do Firecrawl
+- `supabaseUrl`: URL do seu projeto Supabase (ex: https://xxx.supabase.co)
+- `supabaseKey`: Service role key ou anon key com RLS configurado
+
+## Tools Disponíveis
+
+### `process_urls`
+
+Processa uma lista de URLs:
+- Extrai conteúdo limpo usando Firecrawl
+- Gera fingerprint único (SHA-256 de title + body normalizado)
+- Verifica se já existe no Supabase
+- Salva novo conteúdo ou atualiza se fingerprint mudou
+- Retorna resumo focado em insights
+
+**Input:**
+```json
+{
+  "urls": ["https://example.com/article-1", "https://example.com/article-2"],
+  "generateSummaries": true
+}
+```
+
+**Output:**
+```json
+{
+  "processed": [
+    {
+      "url": "https://example.com/article-1",
+      "status": "new",
+      "title": "Article Title",
+      "summary": "Key insights from the content...",
+      "fingerprint": "abc123...",
+      "domain": "example.com"
+    }
+  ],
+  "stats": {
+    "total": 2,
+    "new": 1,
+    "updated": 0,
+    "unchanged": 1,
+    "errors": 0
+  }
+}
+```
+
+### `check_updates`
+
+Verifica status de URLs processadas anteriormente sem re-extrair:
+
+**Input:**
+```json
+{
+  "domain": "example.com"
+}
+```
+
+### `get_watermarks`
+
+Obtém watermarks (última vez processada) por domínio:
+
+**Input:**
+```json
+{
+  "domain": "example.com"
+}
+```
+
+## Lógica de Deduplicação
+
+1. **Normalização**: title e body são normalizados (lowercase, whitespace colapsado, Unicode normalizado)
+2. **Fingerprint**: SHA-256 do texto normalizado `title|body`
+3. **Verificação**:
+   - Se URL não existe → conteúdo **novo**
+   - Se URL existe mas fingerprint diferente → **update**
+   - Se URL existe e fingerprint igual → **ignorar**
+
+## Desenvolvimento
+
+```bash
+cd content-scraper
+bun install
+bun run dev     # Desenvolvimento local
+bun run deploy  # Deploy para produção
+```
+
+## Arquitetura
+
+```
+content-scraper/
+├── server/
+│   ├── main.ts              # Entry point e StateSchema
+│   ├── lib/
+│   │   ├── firecrawl.ts     # Cliente Firecrawl API
+│   │   ├── supabase.ts      # Cliente Supabase para persistência
+│   │   ├── content.ts       # Normalização, fingerprint, resumo
+│   │   └── types.ts         # Tipos compartilhados
+│   └── tools/
+│       ├── index.ts         # Exporta todas as tools
+│       └── scraper.ts       # Tools de processamento
+├── shared/
+│   └── deco.gen.ts          # Tipos gerados
+├── package.json
+├── wrangler.toml
+└── tsconfig.json
+```
diff --git a/content-scraper/app.json b/content-scraper/app.json
@@ -0,0 +1,12 @@
+{
+  "scopeName": "deco",
+  "name": "content-scraper",
+  "friendlyName": "Content Scraper",
+  "connection": {
+    "type": "HTTP",
+    "url": "https://content-scraper.decocache.com/mcp"
+  },
+  "description": "Scrape web content from URLs using n8n workflow automation.",
+  "icon": "https://assets.decocache.com/mcp/content-scraper-icon.svg",
+  "unlisted": false
+}
diff --git a/content-scraper/package.json b/content-scraper/package.json
@@ -0,0 +1,37 @@
+{
+  "name": "mcp-content-scraper",
+  "version": "1.0.0",
+  "description": "Content extraction, deduplication and summarization MCP using Firecrawl and Supabase",
+  "private": true,
+  "type": "module",
+  "scripts": {
+    "dev": "deco dev --vite",
+    "configure": "deco configure",
+    "gen": "deco gen --output=shared/deco.gen.ts",
+    "deploy": "npm run build && deco deploy ./dist/server",
+    "check": "tsc --noEmit",
+    "build:server": "NODE_ENV=production bun build server/main.ts --target=bun --outfile=dist/server/main.js",
+    "build": "bun run build:server"
+  },
+  "dependencies": {
+    "@decocms/runtime": "^1.1.0",
+    "@supabase/supabase-js": "^2.49.0",
+    "zod": "^4.0.0"
+  },
+  "devDependencies": {
+    "@cloudflare/vite-plugin": "^1.13.4",
+    "@cloudflare/workers-types": "^4.20251014.0",
+    "@decocms/mcps-shared": "1.0.0",
+    "@mastra/core": "^0.24.0",
+    "@modelcontextprotocol/sdk": "^1.21.0",
+    "@types/mime-db": "^1.43.6",
+    "deco-cli": "^0.26.0",
+    "typescript": "^5.7.2",
+    "vite": "7.2.0",
+    "wrangler": "^4.28.0"
+  },
+  "engines": {
+    "node": ">=22.0.0"
+  }
+}
+
diff --git a/content-scraper/server/main.ts b/content-scraper/server/main.ts
@@ -0,0 +1,22 @@
+/**
+ * Content Scraper MCP
+ *
+ * Simple MCP that scrapes web content via n8n webhook.
+ */
+import { serve } from "@decocms/mcps-shared/serve";
+import { withRuntime } from "@decocms/runtime";
+import { tools } from "./tools/index.ts";
+import { type Env, StateSchema } from "./types/env.ts";
+
+export type { Env };
+export { StateSchema };
+
+const runtime = withRuntime<Env, typeof StateSchema>({
+  configuration: {
+    scopes: [],
+    state: StateSchema,
+  },
+  tools,
+});
+
+serve(runtime.fetch);
diff --git a/content-scraper/server/tools/index.ts b/content-scraper/server/tools/index.ts
@@ -0,0 +1,10 @@
+/**
+ * Central export point for all tools.
+ */
+import { scraperTools } from "./scraper.ts";
+
+// Export all tools
+export const tools = [...scraperTools];
+
+// Re-export domain-specific tools for direct access if needed
+export { scraperTools } from "./scraper.ts";
diff --git a/content-scraper/server/tools/scraper.ts b/content-scraper/server/tools/scraper.ts
@@ -0,0 +1,58 @@
+/**
+ * Content scraping tool via n8n webhook.
+ */
+import { z } from "zod";
+import { createPrivateTool } from "@decocms/runtime/tools";
+import type { Env } from "../types/env.ts";
+
+/**
+ * Call the n8n webhook to scrape content from a URL.
+ */
+export const scrapeContentTool = (env: Env) =>
+  createPrivateTool({
+    id: "scrape_content",
+    description:
+      "Scrape content from a URL using the n8n workflow. " +
+      "Extracts and processes web content through an automated pipeline.",
+    inputSchema: z.object({
+      url: z.string().url().describe("The URL to scrape content from"),
+    }),
+    outputSchema: z.object({
+      success: z.boolean(),
+      data: z.unknown().optional(),
+      error: z.string().optional(),
+    }),
+    execute: async ({ context: input }) => {
+      try {
+        const { n8nWebhookUrl } = env.DECO_CHAT_REQUEST_CONTEXT.state;
+        const url = new URL(n8nWebhookUrl);
+        url.searchParams.set("url", input.url);
+
+        const response = await fetch(url.toString());
+
+        if (!response.ok) {
+          return {
+            success: false,
+            error: `Webhook returned ${response.status}: ${response.statusText}`,
+          };
+        }
+
+        const data = await response.json();
+
+        return {
+          success: true,
+          data,
+        };
+      } catch (error) {
+        return {
+          success: false,
+          error: error instanceof Error ? error.message : "Unknown error",
+        };
+      }
+    },
+  });
+
+/**
+ * Export all scraper tools
+ */
+export const scraperTools = [scrapeContentTool];
diff --git a/content-scraper/server/types/env.ts b/content-scraper/server/types/env.ts
@@ -0,0 +1,17 @@
+/**
+ * Environment Type Definitions
+ */
+import { type DefaultEnv } from "@decocms/runtime";
+import { z } from "zod";
+
+export const StateSchema = z.object({
+  n8nWebhookUrl: z.string().url().describe("URL do webhook N8N para scraping"),
+});
+
+type State = z.infer<typeof StateSchema>;
+
+export type Env = DefaultEnv<typeof StateSchema> & {
+  DECO_CHAT_REQUEST_CONTEXT: {
+    state: State;
+  };
+};
diff --git a/content-scraper/tsconfig.json b/content-scraper/tsconfig.json
@@ -0,0 +1,35 @@
+{
+  "compilerOptions": {
+    "target": "ES2022",
+    "useDefineForClassFields": true,
+    "lib": ["ES2023"],
+    "module": "ESNext",
+    "skipLibCheck": true,
+
+    /* Bundler mode */
+    "moduleResolution": "bundler",
+    "allowImportingTsExtensions": true,
+    "isolatedModules": true,
+    "verbatimModuleSyntax": false,
+    "moduleDetection": "force",
+    "noEmit": true,
+    "allowJs": true,
+
+    /* Linting */
+    "strict": true,
+    "noUnusedLocals": true,
+    "noUnusedParameters": true,
+    "noFallthroughCasesInSwitch": true,
+    "noUncheckedSideEffectImports": true,
+
+    /* Path Aliases */
+    "baseUrl": ".",
+    "paths": {
+      "server/*": ["./server/*"]
+    },
+
+    /* Types */
+    "types": ["@types/node"]
+  },
+  "include": ["server"]
+}
diff --git a/package.json b/package.json
@@ -21,6 +21,7 @@
   },
   "workspaces": [
     "apify",
+    "content-scraper",
     "data-for-seo",
     "datajud",
     "gemini-pro-vision",
diff --git a/shared/package.json b/shared/package.json
@@ -20,7 +20,7 @@
         "./serve": "./serve.ts"
     },
     "devDependencies": {
-        "@decocms/runtime": "0.25.1",
+        "@decocms/runtime": "^1.1.0",
         "@types/bun": "^1.2.14",
         "vite": "7.2.0",
         "zod": "^4.0.0"