feat(linkedin): add recommended jobs adapter with GraphQL pagination support#51
Open
RickSanchez88E wants to merge 16 commits intonashsu:mainfrom
Open
feat(linkedin): add recommended jobs adapter with GraphQL pagination support#51RickSanchez88E wants to merge 16 commits intonashsu:mainfrom
RickSanchez88E wants to merge 16 commits intonashsu:mainfrom
Conversation
Adds `linkedin recommended` adapter for crawling LinkedIn JYMBII algorithm recommended jobs via GraphQL API. Supports automatic pagination, Easy Apply detection via footerItems EASY_APPLY_TEXT, workplace type parsing, and unlimited mode (--limit 0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…quest signatures, pagination, and test commands
Local LLM (qwen3) → structured JSON → Supabase pipeline: - 5-module Python pipeline: config, preprocess, LLM, db, orchestrator - Grammar-constrained generation via llama.cpp json_schema - 3-attempt retry at temp=0: standard → repair → minimal - Atomic claim/upsert via Supabase RPC functions - Stale processing reaper, dead-letter queue, extraction_runs tracking - Per-run report: console summary + failed-jobs detail + JSON report Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
linkedin recommended --limit 0 --with_jd triggers long-running commands that scroll the full job list and fetch descriptions for each, which can exceed the previous 30-second HTTP timeout. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add clean_linkedin_jobs.py pipeline that extracts URLs from multiple fields, normalizes URLs, validates LinkedIn records (require easy_apply or external_url), and maps apply_url/source_channel/apply_type correctly. Includes: - clean_linkedin_jobs.py: HTML cleaning, URL extraction cascade, salary parsing, batch dedup, dead letter queue - sync_autocli_jobs.py: Supabase RPC upsert with source_channel/apply_type - 23 unit tests with TDD (clean + sync + validation + URL mapping) - 5 migrations: schema, url_hash, source_channel/apply_type, drop url_hash unique constraint, old data cleanup - daemon health check wait in main.rs bad_count invariant: 776 -> 0 (after cleanup + pipeline fix)
Chrome debugger can detach mid-command on SPA pages (e.g. LinkedIn), returning "Detached while handling command". This error was not in the retry list, causing the extension to give up immediately instead of re-attaching and retrying. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extension `WINDOW_IDLE_TIMEOUT` (30s) would fire during evaluate steps that run longer than the timeout (e.g. --limit 0 fetching all LinkedIn recommended jobs). Added activeCommands counter per workspace so the idle timer only starts when no commands are in-flight. Added `scripts/autocli-baseline.sh` with 8 pre-flight checks (autocli binary, Chrome process, daemon, extension, LinkedIn reachability, DNS, output dir, disk space) with structured timestamped logging and --json output. Includes 13-test suite at `scripts/test_baseline.sh`.
`check_extension_freshness` compares dist/background.js mtime against a refresh marker file (.baseline-last-refresh). On first run (no marker) it warns; when dist is newer than last refresh it fails with a clear hint to use --refresh-extension. `--refresh-extension` uses browser-harness CDP to navigate to chrome://extensions, find the AutoCLI card, and click its reload button, then updates the marker. Test suite now has 15 tests covering all freshness scenarios.
sync_autocli_jobs.py looked for "apply_type" key in raw records, but LinkedIn raw data uses "easy_apply". Records from this pipeline were silently defaulted to apply_type='unknown'. Added a fallback check for the "easy_apply" field to correctly classify LinkedIn easy-apply jobs. Also ran a SQL migration to fix 271 existing rows that were affected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a new
linkedin recommendedcommand that crawls LinkedIn's personalized job recommendation feed (JYMBII algorithm, at/jobs/collections/recommended/). Unlike the existinglinkedin searchadapter (REST Voyager API), this endpoint uses GraphQL (/voyager/api/graphql) and requires a browser session.File:
adapters/linkedin/recommended.yamlTechnical Details
voyagerJobsDashJobCards.*(version-hashed, discovered dynamically via Performance API)strategy: headerwith CSRF token extracted fromJSESSIONIDcookiestartoffset--limit 0crawls until no more items (limit > 0 ? limit - fetched : BATCHloop)footerItems[].type === "EASY_APPLY_TEXT"(noteasyApplyUrlwhich doesn't exist in this API)secondaryDescription.textparentheses, e.g."London (Hybrid)"→workplace_type: "Hybrid"Output Columns
rank,title,company,location,workplace_type,salary,posted_time,applicant_count,easy_apply,urlUsage
How to Test
Prerequisites: Chrome must be open with LinkedIn signed in, and the AutoCLI Chrome extension must be installed.
Known Quirks / Pitfalls
:) and parentheses to remain raw (not URL-encoded) in GraphQL variables. FullencodeURIComponentcauses HTTP 400. The adapter uses a partial-encode-then-decode approach.totalCountfield.--limit 0fetches incrementally until the server returns an empty batch.applicant_count: Unlike the REST search API, this GraphQL endpoint'sjobPostingCarddoesn't include applicant count. Column is preserved but always returns"N/A".easyApplyUrlfield: Easy Apply detection usesfooterItemstype — verified via 200-job crawl with ~30% Easy Apply rate.performance.getEntriesByType('resource'), so no hardcoded ID to maintain.On-site/Hybrid/Remoteand strips it from the location field.