nashsu · RickSanchez88E · Apr 29, 2026 · May 1, 2026 · May 1, 2026 · May 2, 2026
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,12 @@
 /opencli
 test-results*.md
 twitter-downloads/
+
+# Local project files
+CHANGELOG.md
+/output/
+/test/
+.env
+
+# Python
+__pycache__/
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,136 @@
+# AGENTS.md
+# Core Rule
+
+Use Serena first for code intelligence on non-trivial coding tasks, and use bounded subagents for complex engineering work.
+
+Do not claim Serena or subagents were used unless they actually were. If a required tool is unavailable, say so and continue with the smallest safe fallback.
+
+## Serena Workflow
+
+At the start of any non-trivial coding task (see definitions below), unfamiliar-code task, bug investigation, shared-symbol change, or cross-file change:
+
+Definitions:
+- **Non-trivial**: Any change affecting ≥1 function with external dependencies, ≥3 files, or requiring architectural reasoning
+- **Trivial**: Typo fixes, one-line config changes, single-file docs edits without code path impact
+
+Do not run the full Serena workflow for trivial tasks unless the code path is unfamiliar or risky.
+
+1. Check Serena availability.
+2. Run `serena.get_current_config`.
+3. If the active Serena project does not match the repository root, run `serena.activate_project`.
+4. Run `serena.check_onboarding_performed`.
+5. If onboarding is missing, run `serena.onboarding`.
+6. Read only relevant Serena memories.
+
+If Serena is unavailable, say:
+
+> Serena MCP is unavailable; falling back to built-in search/read tools.
+
+Then continue with targeted `rg`, file reads, and normal verification.
+
+Do not run the full Serena workflow for typo fixes, simple docs edits, or one-line config changes unless the code path is unfamiliar or risky.
+
+## Serena Navigation
+
+Prefer Serena before broad file reads:
+
+1. `serena.get_symbols_overview` for unfamiliar files.
+2. `serena.find_symbol` for functions, classes, handlers, schemas, adapters, providers, components, exported APIs, and config objects.
+3. `serena.find_referencing_symbols` before changing shared/public symbols.
+4. `serena.find_implementations` for interfaces, adapters, providers, and polymorphic dispatch.
+5. `serena.get_diagnostics_for_file` after meaningful edits.
+
+Use raw `rg`, grep, or full-file reads only when:
+
+- the target is not code,
+- the symbol name is unknown,
+- Serena cannot resolve the result,
+- Serena has already narrowed the search area,
+- or the task is trivial enough that Serena overhead exceeds value.
+
+Do not read entire large files first.
+
+## Editing Rules
+
+Before editing:
+
+- Map the real call path.
+- Check references for shared/exported symbols.
+- Pick the smallest safe patch.
+- Avoid unrelated files.
+- Prefer symbol-level edits for whole functions/classes/methods.
+- Add or update tests when behavior changes.
+
+After editing:
+
+1. Run the smallest relevant verification first (see Verification Tiers below).
+2. Then run broader checks if the change is cross-file or high-risk.
+3. Summarize changed files, reason, and verification result.
+
+Verification Tiers:
+- **Tier 1 (local)**: Single unit test or type check for the edited function/method
+- **Tier 2 (module)**: All tests in the affected package/directory
+- **Tier 3 (integration)**: Cross-module or end-to-end verification for cross-file/high-risk changes
+
+## Subagent Policy
+
+Use subagents for:
+
+- cross-file or cross-module changes,
+- unknown root cause,
+- refactors,
+- security/auth changes,
+- data-loss or migration risk,
+- queue/worker/scraper/infra changes,
+- PR or adversarial review,
+- bugs where investigation, review, and fix can be separated.
+
+Do not use subagents for:
+
+- direct Q&A,
+- typo fixes,
+- one-file trivial edits,
+- simple config changes,
+- tasks where overhead exceeds value.
+
+If subagents are unavailable, say so and continue in the parent agent using the same sequence manually: explore read-only, review risks, patch only if needed, then verify.
+
+## Subagent Roles
+
+- `explorer`: read-only. Map execution paths, symbols, references, data flow, likely owners, and risky files.
+- `reviewer`: read-only. Look for correctness bugs, regressions, race conditions, idempotency issues, auth/security problems, migration/data-loss risks, missing tests, and rollback gaps.
+- `fixer`: may edit only after the code path is understood. Keep the patch small, avoid unrelated files, use Serena reference checks, and verify targeted changes.
+
+Subagents may recommend actions, but must not broaden scope, introduce new architecture, or modify unrelated modules without parent approval.
+
+## Subagent Flow
+
+For complex tasks:
+
+1. Spawn `explorer` first.
+2. Spawn `reviewer` in parallel only when risk review helps.
+3. Wait for read-only findings.
+4. Summarize the evidence.
+5. Spawn `fixer` only if a patch is needed.
+6. Run verification.
+7. For high-risk changes, run one final reviewer pass.
+
+Default limit:
+
+- `explorer`: at most 1 before editing
+- `reviewer`: at most 1 in parallel with explorer or after
+- `fixer`: at most 1, only after read-only findings are complete
+- Do not create more subagents unless the user explicitly asks or a P0/P1 risk remains unresolved.
+- Maximum total: 3 subagents per task (2 read-only + 1 fixer)
+
+Subagents must return:
+
+- scope inspected,
+- Serena tools used,
+- key symbols/files,
+- findings,
+- risks,
+- recommended next action,
+- confidence level.
+
+Parent Codex owns the final decision.
diff --git a/CHANGELOG_pipeline.md b/CHANGELOG_pipeline.md
@@ -0,0 +1,21 @@
+# JD Structured Extraction Pipeline — Changelog
+
+## [0.1.0] — 2026-05-03
+
+### Added
+
+- **Pipeline orchestrator** (`jd_pipeline.py`): CLI tool (`--input`, `--dry-run`, `--limit`) that reads `output/final.json`, preprocesses JDs, extracts structured JSON via local LLM, and upserts results into Supabase.
+- **LLM client** (`jd_pipeline_llm.py`): Async batch client for llama.cpp `/chat/completions` with grammar-constrained generation (`json_schema`), 3-attempt retry (standard → repair with validation feedback → minimal), dynamic timeout, semaphore-limited concurrency, and latency tracking.
+- **Database client** (`jd_pipeline_db.py`): Atomic `claim_job` / `upsert_job_structured` / `mark_dead_letter` / `reap_stale_processing` RPCs, extraction_runs bookkeeping, `.env` auto-loading.
+- **Config** (`jd_pipeline_config.py`): Version constants, schema definitions (`JD_SCHEMA` + `MINIMAL_SCHEMA`), LLM/Supabase connection params, token limits, context-size tiers.
+- **Preprocessor** (`jd_pipeline_preprocess.py`): LinkedIn boilerplate removal, NFKC normalization, control-char strip, SHA-256 hashing.
+- **Supabase migrations** (6 files + RPC grants): `jobs` columns for structured extraction, `extraction_runs` table, `dead_letter_records` with stage/error tracking, atomic RPC functions with run-id guards.
+- **Per-run reporting**: Console summary with failed-jobs detail (URL, stage, error class, message) + JSON report file in `output/`.
+
+### Fixed
+
+- `dead_letter_records.reason` and `source_schema`/`source_job_id` made nullable — prevents write failures when fields are absent.
+- `skills.maxItems` raised from 30 → 50 to accommodate verbose model output.
+- System prompt improved: explicit rules for skills (technical only, max 25), summary (1–3 sentences), experience_level, employment_type.
+- Stale processing reaper threshold adjustable; 172 stuck jobs reaped successfully.
+- Duplicate counter increments removed — `PipelineStats.record_*()` methods now single source of truth.
diff --git a/adapters/linkedin/recommended.yaml b/adapters/linkedin/recommended.yaml
@@ -0,0 +1,199 @@
+site: linkedin
+name: recommended
+description: "全量爬取 LinkedIn 推荐职位列表，自动翻页获取所有推荐岗位"
+meta_title: "Top job picks for you | LinkedIn"
+meta_description: ""
+meta_keywords: ""
+tags: [linkedin, jobs, recommended, career, recruitment]
+domain: www.linkedin.com
+strategy: header
+browser: true
+timeoutSeconds: 1200
+
+args:
+  limit:
+    type: int
+    required: false
+    default: 200
+    description: "返回结果数量上限 (0=无限制，爬取全部)"
+  start:
+    type: int
+    required: false
+    default: 0
+    description: "分页偏移量"
+  with_jd:
+    type: bool
+    required: false
+    default: false
+    description: "是否输出职位详情字段 jd (false=仅列表字段，true=抓取职位描述)"
+
+columns: [rank, title, company, location, workplace_type, salary, posted_time, applicant_count, easy_apply, url, external_url, jd]
+
+pipeline:
+  - navigate:
+      url: "https://www.linkedin.com/jobs/collections/recommended/"
+      settleMs: 5000
+
+  - evaluate: |
+      (async () => {
+        const allUrls = performance.getEntriesByType('resource').map(e => e.name);
+        let apiMatch = allUrls.find(u => u.includes('/voyager/api/graphql') && u.includes('jobCollectionSlug') && u.includes('recommended'));
+        if (!apiMatch) {
+          apiMatch = allUrls.find(u => u.includes('/voyager/api/graphql') && u.includes('jobCards'));
+        }
+        if (!apiMatch) return [];
+
+        const jsession = document.cookie.split(';').map(p => p.trim())
+          .find(p => p.startsWith('JSESSIONID='))?.slice('JSESSIONID='.length);
+        if (!jsession) throw new Error('LinkedIn JSESSIONID cookie not found. Please sign in.');
+        const csrf = jsession.replace(/^"|"$/g, '');
+
+        const parsed = new URL(apiMatch);
+        const queryId = parsed.searchParams.get('queryId') || '';
+
+        const limit = Number(args.limit ?? 200);
+        const withJd = args.with_jd === true || args.with_jd === 'true';
+        let start = args.start || 0;
+        const BATCH = 24;
+        const allItems = [];
+        const cleanText = (text) => String(text || '').replace(/\s+/g, ' ').trim();
+
+        while (true) {
+          const remaining = limit > 0 ? limit - allItems.length : BATCH;
+          const count = Math.min(BATCH, remaining);
+          if (count <= 0) break;
+
+          const vars = `(count:${count},jobCollectionSlug:recommended,query:(origin:GENERIC_JOB_COLLECTIONS_LANDING),start:${start})`;
+          const fetchUrl = `/voyager/api/graphql?variables=${encodeURIComponent(vars).replace(/%3A/gi, ':').replace(/%2C/gi, ',').replace(/%28/gi, '(').replace(/%29/gi, ')')}&queryId=${queryId}`;
+
+          const resp = await fetch(fetchUrl, {
+            credentials: 'include',
+            headers: {
+              'csrf-token': csrf,
+              'x-restli-protocol-version': '2.0.0',
+            },
+          });
+
+          if (!resp.ok) break;
+
+          const json = await resp.json();
+          const elements = json?.data?.jobsDashJobCardsByJobCollections?.elements || [];
+
+          if (elements.length === 0) break;
+
+          for (const element of elements) {
+            const card = element?.jobCard?.jobPostingCard;
+            if (!card) continue;
+
+            const urn = card.preDashNormalizedJobPostingUrn || card.entityUrn || '';
+            const jobId = urn.match(/(\d+)/)?.[1] || '';
+
+            const listedItem = (card.footerItems || []).find(i => i?.type === 'LISTED_DATE' && i?.timeAt);
+            const postedTime = listedItem?.timeAt ? new Date(listedItem.timeAt).toISOString().slice(0, 10) : '';
+
+            const easyApply = (card.footerItems || []).some(i => i.type === 'EASY_APPLY_TEXT') ? 'true' : 'false';
+
+            // Extract workplace type from location string (e.g. "London (On-site)")
+            const locText = card.secondaryDescription?.text || '';
+            const workplaceMatch = locText.match(/\((Remote|Hybrid|On-site)\)/i);
+            const workplaceType = workplaceMatch ? workplaceMatch[1] : '';
+
+            // Clean location by removing workplace type suffix
+            const location = locText.replace(/\s*\((Remote|Hybrid|On-site)\)\s*/i, '').trim();
+
+            // Check for salary in tertiaryDescription
+            const salary = card.tertiaryDescription?.text || '';
+            const url = jobId ? 'https://www.linkedin.com/jobs/view/' + jobId : '';
+
+            allItems.push({
+              title: card.title?.text || card.jobPostingTitle || '',
+              company: card.primaryDescription?.text || '',
+              location: location,
+              workplace_type: workplaceType,
+              salary: salary,
+              posted_time: postedTime,
+              applicant_count: '',
+              easy_apply: easyApply,
+              url: url,
+              external_url: '',
+              job_id: jobId,
+              jd: '',
+            });
+          }
+
+          if (elements.length < count) break;
+          start += elements.length;
+
+          if (limit > 0 && allItems.length >= limit) break;
+        }
+
+        const extractExternalApplyUrl = (json) => {
+          const offsiteApply = json?.applyMethod?.['com.linkedin.voyager.jobs.OffsiteApply'];
+          return offsiteApply?.companyApplyUrl || '';
+        };
+
+        const fetchJobDetails = async (jobId) => {
+          if (!jobId) return { jd: '', external_url: '' };
+          const url = `/voyager/api/jobs/jobPostings/${jobId}`;
+          const headers = { 'csrf-token': csrf, 'x-restli-protocol-version': '2.0.0' };
+          for (let attempt = 0; attempt < 4; attempt++) {
+            try {
+              const resp = await fetch(url, { credentials: 'include', headers });
+              if (resp.ok) {
+                const json = await resp.json();
+                return {
+                  jd: cleanText(json?.description?.text || json?.description || ''),
+                  external_url: extractExternalApplyUrl(json),
+                };
+              }
+
+              // Retry on transient / throttling responses.
+              if ([429, 500, 502, 503, 504].includes(resp.status)) {
+                await sleep(250 * Math.pow(2, attempt));
+                continue;
+              }
+
+              // Non-retryable.
+              return { jd: '', external_url: '' };
+            } catch (_) {
+              await sleep(250 * Math.pow(2, attempt));
+            }
+          }
+          return { jd: '', external_url: '' };
+        };
+
+        const detailItems = allItems.filter(item => withJd || item.easy_apply === 'false');
+        // LinkedIn will sometimes throttle detail calls when scraping in bulk (e.g. --limit 0).
+        // Lower concurrency and retry on transient failures to avoid dropping external_url/jd.
+        const detailConcurrency = 8;
+        const sleep = (ms) => new Promise(r => setTimeout(r, ms));
+        for (let i = 0; i < detailItems.length; i += detailConcurrency) {
+          const batch = detailItems.slice(i, i + detailConcurrency);
+          const details = await Promise.all(batch.map(item => fetchJobDetails(item.job_id)));
+          details.forEach((detail, index) => {
+            batch[index].external_url = detail.external_url;
+            if (withJd) {
+              batch[index].jd = detail.jd;
+            }
+          });
+        }
+
+        return allItems.slice(0, limit > 0 ? limit : undefined).map((item, i) => ({
+          rank: i + 1,
+          ...item,
+        }));
+      })()
+
+  - map:
+      rank: ${{ item.rank }}
+      title: ${{ item.title | default("N/A") }}
+      company: ${{ item.company | default("N/A") }}
+      location: ${{ item.location | default("N/A") }}
+      workplace_type: ${{ item.workplace_type | default("N/A") }}
+      salary: ${{ item.salary | default("N/A") }}
+      posted_time: ${{ item.posted_time | default("N/A") }}
+      applicant_count: ${{ item.applicant_count | default("N/A") }}
+      easy_apply: ${{ item.easy_apply | default("false") }}
+      url: ${{ item.url }}
+      external_url: ${{ item.external_url | default("") }}
+      jd: ${{ item.jd | default("") }}
diff --git a/crates/autocli-browser/src/daemon_client.rs b/crates/autocli-browser/src/daemon_client.rs
@@ -17,8 +17,10 @@ const RETRY_DELAYS_MS: [u64; 4] = [200, 500, 1000, 2000];
 impl DaemonClient {
     /// Create a new client pointing at the given port on localhost.
     pub fn new(port: u16) -> Self {
+        // 5 minute timeout — linkedin --limit 0 --with_jd can take several minutes
+        // due to scrolling the full job list and fetching descriptions for each.
         let client = reqwest::Client::builder()
-            .timeout(Duration::from_secs(30))
+            .timeout(Duration::from_secs(300))
             .build()
             .expect("failed to build reqwest client");
         Self {