Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,12 @@
/opencli
test-results*.md
twitter-downloads/

# Local project files
CHANGELOG.md
/output/
/test/
.env

# Python
__pycache__/
136 changes: 136 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# AGENTS.md
# Core Rule

Use Serena first for code intelligence on non-trivial coding tasks, and use bounded subagents for complex engineering work.

Do not claim Serena or subagents were used unless they actually were. If a required tool is unavailable, say so and continue with the smallest safe fallback.

## Serena Workflow

At the start of any non-trivial coding task (see definitions below), unfamiliar-code task, bug investigation, shared-symbol change, or cross-file change:

Definitions:
- **Non-trivial**: Any change affecting ≥1 function with external dependencies, ≥3 files, or requiring architectural reasoning
- **Trivial**: Typo fixes, one-line config changes, single-file docs edits without code path impact

Do not run the full Serena workflow for trivial tasks unless the code path is unfamiliar or risky.

1. Check Serena availability.
2. Run `serena.get_current_config`.
3. If the active Serena project does not match the repository root, run `serena.activate_project`.
4. Run `serena.check_onboarding_performed`.
5. If onboarding is missing, run `serena.onboarding`.
6. Read only relevant Serena memories.

If Serena is unavailable, say:

> Serena MCP is unavailable; falling back to built-in search/read tools.

Then continue with targeted `rg`, file reads, and normal verification.

Do not run the full Serena workflow for typo fixes, simple docs edits, or one-line config changes unless the code path is unfamiliar or risky.

## Serena Navigation

Prefer Serena before broad file reads:

1. `serena.get_symbols_overview` for unfamiliar files.
2. `serena.find_symbol` for functions, classes, handlers, schemas, adapters, providers, components, exported APIs, and config objects.
3. `serena.find_referencing_symbols` before changing shared/public symbols.
4. `serena.find_implementations` for interfaces, adapters, providers, and polymorphic dispatch.
5. `serena.get_diagnostics_for_file` after meaningful edits.

Use raw `rg`, grep, or full-file reads only when:

- the target is not code,
- the symbol name is unknown,
- Serena cannot resolve the result,
- Serena has already narrowed the search area,
- or the task is trivial enough that Serena overhead exceeds value.

Do not read entire large files first.

## Editing Rules

Before editing:

- Map the real call path.
- Check references for shared/exported symbols.
- Pick the smallest safe patch.
- Avoid unrelated files.
- Prefer symbol-level edits for whole functions/classes/methods.
- Add or update tests when behavior changes.

After editing:

1. Run the smallest relevant verification first (see Verification Tiers below).
2. Then run broader checks if the change is cross-file or high-risk.
3. Summarize changed files, reason, and verification result.

Verification Tiers:
- **Tier 1 (local)**: Single unit test or type check for the edited function/method
- **Tier 2 (module)**: All tests in the affected package/directory
- **Tier 3 (integration)**: Cross-module or end-to-end verification for cross-file/high-risk changes

## Subagent Policy

Use subagents for:

- cross-file or cross-module changes,
- unknown root cause,
- refactors,
- security/auth changes,
- data-loss or migration risk,
- queue/worker/scraper/infra changes,
- PR or adversarial review,
- bugs where investigation, review, and fix can be separated.

Do not use subagents for:

- direct Q&A,
- typo fixes,
- one-file trivial edits,
- simple config changes,
- tasks where overhead exceeds value.

If subagents are unavailable, say so and continue in the parent agent using the same sequence manually: explore read-only, review risks, patch only if needed, then verify.

## Subagent Roles

- `explorer`: read-only. Map execution paths, symbols, references, data flow, likely owners, and risky files.
- `reviewer`: read-only. Look for correctness bugs, regressions, race conditions, idempotency issues, auth/security problems, migration/data-loss risks, missing tests, and rollback gaps.
- `fixer`: may edit only after the code path is understood. Keep the patch small, avoid unrelated files, use Serena reference checks, and verify targeted changes.

Subagents may recommend actions, but must not broaden scope, introduce new architecture, or modify unrelated modules without parent approval.

## Subagent Flow

For complex tasks:

1. Spawn `explorer` first.
2. Spawn `reviewer` in parallel only when risk review helps.
3. Wait for read-only findings.
4. Summarize the evidence.
5. Spawn `fixer` only if a patch is needed.
6. Run verification.
7. For high-risk changes, run one final reviewer pass.

Default limit:

- `explorer`: at most 1 before editing
- `reviewer`: at most 1 in parallel with explorer or after
- `fixer`: at most 1, only after read-only findings are complete
- Do not create more subagents unless the user explicitly asks or a P0/P1 risk remains unresolved.
- Maximum total: 3 subagents per task (2 read-only + 1 fixer)

Subagents must return:

- scope inspected,
- Serena tools used,
- key symbols/files,
- findings,
- risks,
- recommended next action,
- confidence level.

Parent Codex owns the final decision.
21 changes: 21 additions & 0 deletions CHANGELOG_pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# JD Structured Extraction Pipeline — Changelog

## [0.1.0] — 2026-05-03

### Added

- **Pipeline orchestrator** (`jd_pipeline.py`): CLI tool (`--input`, `--dry-run`, `--limit`) that reads `output/final.json`, preprocesses JDs, extracts structured JSON via local LLM, and upserts results into Supabase.
- **LLM client** (`jd_pipeline_llm.py`): Async batch client for llama.cpp `/chat/completions` with grammar-constrained generation (`json_schema`), 3-attempt retry (standard → repair with validation feedback → minimal), dynamic timeout, semaphore-limited concurrency, and latency tracking.
- **Database client** (`jd_pipeline_db.py`): Atomic `claim_job` / `upsert_job_structured` / `mark_dead_letter` / `reap_stale_processing` RPCs, extraction_runs bookkeeping, `.env` auto-loading.
- **Config** (`jd_pipeline_config.py`): Version constants, schema definitions (`JD_SCHEMA` + `MINIMAL_SCHEMA`), LLM/Supabase connection params, token limits, context-size tiers.
- **Preprocessor** (`jd_pipeline_preprocess.py`): LinkedIn boilerplate removal, NFKC normalization, control-char strip, SHA-256 hashing.
- **Supabase migrations** (6 files + RPC grants): `jobs` columns for structured extraction, `extraction_runs` table, `dead_letter_records` with stage/error tracking, atomic RPC functions with run-id guards.
- **Per-run reporting**: Console summary with failed-jobs detail (URL, stage, error class, message) + JSON report file in `output/`.

### Fixed

- `dead_letter_records.reason` and `source_schema`/`source_job_id` made nullable — prevents write failures when fields are absent.
- `skills.maxItems` raised from 30 → 50 to accommodate verbose model output.
- System prompt improved: explicit rules for skills (technical only, max 25), summary (1–3 sentences), experience_level, employment_type.
- Stale processing reaper threshold adjustable; 172 stuck jobs reaped successfully.
- Duplicate counter increments removed — `PipelineStats.record_*()` methods now single source of truth.
199 changes: 199 additions & 0 deletions adapters/linkedin/recommended.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
site: linkedin
name: recommended
description: "全量爬取 LinkedIn 推荐职位列表,自动翻页获取所有推荐岗位"
meta_title: "Top job picks for you | LinkedIn"
meta_description: ""
meta_keywords: ""
tags: [linkedin, jobs, recommended, career, recruitment]
domain: www.linkedin.com
strategy: header
browser: true
timeoutSeconds: 1200

args:
limit:
type: int
required: false
default: 200
description: "返回结果数量上限 (0=无限制,爬取全部)"
start:
type: int
required: false
default: 0
description: "分页偏移量"
with_jd:
type: bool
required: false
default: false
description: "是否输出职位详情字段 jd (false=仅列表字段,true=抓取职位描述)"

columns: [rank, title, company, location, workplace_type, salary, posted_time, applicant_count, easy_apply, url, external_url, jd]

pipeline:
- navigate:
url: "https://www.linkedin.com/jobs/collections/recommended/"
settleMs: 5000

- evaluate: |
(async () => {
const allUrls = performance.getEntriesByType('resource').map(e => e.name);
let apiMatch = allUrls.find(u => u.includes('/voyager/api/graphql') && u.includes('jobCollectionSlug') && u.includes('recommended'));
if (!apiMatch) {
apiMatch = allUrls.find(u => u.includes('/voyager/api/graphql') && u.includes('jobCards'));
}
if (!apiMatch) return [];

const jsession = document.cookie.split(';').map(p => p.trim())
.find(p => p.startsWith('JSESSIONID='))?.slice('JSESSIONID='.length);
if (!jsession) throw new Error('LinkedIn JSESSIONID cookie not found. Please sign in.');
const csrf = jsession.replace(/^"|"$/g, '');

const parsed = new URL(apiMatch);
const queryId = parsed.searchParams.get('queryId') || '';

const limit = Number(args.limit ?? 200);
const withJd = args.with_jd === true || args.with_jd === 'true';
let start = args.start || 0;
const BATCH = 24;
const allItems = [];
const cleanText = (text) => String(text || '').replace(/\s+/g, ' ').trim();

while (true) {
const remaining = limit > 0 ? limit - allItems.length : BATCH;
const count = Math.min(BATCH, remaining);
if (count <= 0) break;

const vars = `(count:${count},jobCollectionSlug:recommended,query:(origin:GENERIC_JOB_COLLECTIONS_LANDING),start:${start})`;
const fetchUrl = `/voyager/api/graphql?variables=${encodeURIComponent(vars).replace(/%3A/gi, ':').replace(/%2C/gi, ',').replace(/%28/gi, '(').replace(/%29/gi, ')')}&queryId=${queryId}`;

const resp = await fetch(fetchUrl, {
credentials: 'include',
headers: {
'csrf-token': csrf,
'x-restli-protocol-version': '2.0.0',
},
});

if (!resp.ok) break;

const json = await resp.json();
const elements = json?.data?.jobsDashJobCardsByJobCollections?.elements || [];

if (elements.length === 0) break;

for (const element of elements) {
const card = element?.jobCard?.jobPostingCard;
if (!card) continue;

const urn = card.preDashNormalizedJobPostingUrn || card.entityUrn || '';
const jobId = urn.match(/(\d+)/)?.[1] || '';

const listedItem = (card.footerItems || []).find(i => i?.type === 'LISTED_DATE' && i?.timeAt);
const postedTime = listedItem?.timeAt ? new Date(listedItem.timeAt).toISOString().slice(0, 10) : '';

const easyApply = (card.footerItems || []).some(i => i.type === 'EASY_APPLY_TEXT') ? 'true' : 'false';

// Extract workplace type from location string (e.g. "London (On-site)")
const locText = card.secondaryDescription?.text || '';
const workplaceMatch = locText.match(/\((Remote|Hybrid|On-site)\)/i);
const workplaceType = workplaceMatch ? workplaceMatch[1] : '';

// Clean location by removing workplace type suffix
const location = locText.replace(/\s*\((Remote|Hybrid|On-site)\)\s*/i, '').trim();

// Check for salary in tertiaryDescription
const salary = card.tertiaryDescription?.text || '';
const url = jobId ? 'https://www.linkedin.com/jobs/view/' + jobId : '';

allItems.push({
title: card.title?.text || card.jobPostingTitle || '',
company: card.primaryDescription?.text || '',
location: location,
workplace_type: workplaceType,
salary: salary,
posted_time: postedTime,
applicant_count: '',
easy_apply: easyApply,
url: url,
external_url: '',
job_id: jobId,
jd: '',
});
}

if (elements.length < count) break;
start += elements.length;

if (limit > 0 && allItems.length >= limit) break;
}

const extractExternalApplyUrl = (json) => {
const offsiteApply = json?.applyMethod?.['com.linkedin.voyager.jobs.OffsiteApply'];
return offsiteApply?.companyApplyUrl || '';
};

const fetchJobDetails = async (jobId) => {
if (!jobId) return { jd: '', external_url: '' };
const url = `/voyager/api/jobs/jobPostings/${jobId}`;
const headers = { 'csrf-token': csrf, 'x-restli-protocol-version': '2.0.0' };
for (let attempt = 0; attempt < 4; attempt++) {
try {
const resp = await fetch(url, { credentials: 'include', headers });
if (resp.ok) {
const json = await resp.json();
return {
jd: cleanText(json?.description?.text || json?.description || ''),
external_url: extractExternalApplyUrl(json),
};
}

// Retry on transient / throttling responses.
if ([429, 500, 502, 503, 504].includes(resp.status)) {
await sleep(250 * Math.pow(2, attempt));
continue;
}

// Non-retryable.
return { jd: '', external_url: '' };
} catch (_) {
await sleep(250 * Math.pow(2, attempt));
}
}
return { jd: '', external_url: '' };
};

const detailItems = allItems.filter(item => withJd || item.easy_apply === 'false');
// LinkedIn will sometimes throttle detail calls when scraping in bulk (e.g. --limit 0).
// Lower concurrency and retry on transient failures to avoid dropping external_url/jd.
const detailConcurrency = 8;
const sleep = (ms) => new Promise(r => setTimeout(r, ms));
for (let i = 0; i < detailItems.length; i += detailConcurrency) {
const batch = detailItems.slice(i, i + detailConcurrency);
const details = await Promise.all(batch.map(item => fetchJobDetails(item.job_id)));
details.forEach((detail, index) => {
batch[index].external_url = detail.external_url;
if (withJd) {
batch[index].jd = detail.jd;
}
});
}

return allItems.slice(0, limit > 0 ? limit : undefined).map((item, i) => ({
rank: i + 1,
...item,
}));
})()

- map:
rank: ${{ item.rank }}
title: ${{ item.title | default("N/A") }}
company: ${{ item.company | default("N/A") }}
location: ${{ item.location | default("N/A") }}
workplace_type: ${{ item.workplace_type | default("N/A") }}
salary: ${{ item.salary | default("N/A") }}
posted_time: ${{ item.posted_time | default("N/A") }}
applicant_count: ${{ item.applicant_count | default("N/A") }}
easy_apply: ${{ item.easy_apply | default("false") }}
url: ${{ item.url }}
external_url: ${{ item.external_url | default("") }}
jd: ${{ item.jd | default("") }}
4 changes: 3 additions & 1 deletion crates/autocli-browser/src/daemon_client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,10 @@ const RETRY_DELAYS_MS: [u64; 4] = [200, 500, 1000, 2000];
impl DaemonClient {
/// Create a new client pointing at the given port on localhost.
pub fn new(port: u16) -> Self {
// 5 minute timeout — linkedin --limit 0 --with_jd can take several minutes
// due to scrolling the full job list and fetching descriptions for each.
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.timeout(Duration::from_secs(300))
.build()
.expect("failed to build reqwest client");
Self {
Expand Down
Loading