Skip to content

feat(linkedin): add recommended jobs adapter with GraphQL pagination support#51

Open
RickSanchez88E wants to merge 16 commits intonashsu:mainfrom
RickSanchez88E:main
Open

feat(linkedin): add recommended jobs adapter with GraphQL pagination support#51
RickSanchez88E wants to merge 16 commits intonashsu:mainfrom
RickSanchez88E:main

Conversation

@RickSanchez88E
Copy link
Copy Markdown

Description

Adds a new linkedin recommended command that crawls LinkedIn's personalized job recommendation feed (JYMBII algorithm, at /jobs/collections/recommended/). Unlike the existing linkedin search adapter (REST Voyager API), this endpoint uses GraphQL (/voyager/api/graphql) and requires a browser session.

File: adapters/linkedin/recommended.yaml

Technical Details

  • API: LinkedIn uses GraphQL with queryId voyagerJobsDashJobCards.* (version-hashed, discovered dynamically via Performance API)
  • Auth: strategy: header with CSRF token extracted from JSESSIONID cookie
  • Pagination: Batches of 24 items, automatic multi-page crawl via start offset
  • Unlimited mode: --limit 0 crawls until no more items (limit > 0 ? limit - fetched : BATCH loop)
  • Easy Apply detection: Checks footerItems[].type === "EASY_APPLY_TEXT" (not easyApplyUrl which doesn't exist in this API)
  • Workplace type: Parsed from secondaryDescription.text parentheses, e.g. "London (Hybrid)"workplace_type: "Hybrid"

Output Columns

rank, title, company, location, workplace_type, salary, posted_time, applicant_count, easy_apply, url

Usage

# Default 200 results
autocli linkedin recommended -f json

# Specify count
autocli linkedin recommended --limit 50 -f json

# Unlimited (crawls all available)
autocli linkedin recommended --limit 0 -f json

# Table format
autocli linkedin recommended --limit 20

# CSV
autocli linkedin recommended --limit 100 -f csv

How to Test

Prerequisites: Chrome must be open with LinkedIn signed in, and the AutoCLI Chrome extension must be installed.

# Quick smoke test (5 results)
autocli linkedin recommended --limit 5

# Verify Easy Apply detection (should see "true" values)
autocli linkedin recommended --limit 24 -f json | grep easy_apply

# Verify pagination (should return exactly 50)
autocli linkedin recommended --limit 50 -f json | python3 -c "import json,sys; d=json.load(sys.stdin); print(f'{len(d)} results')"

# Diagnostic (verify auth & API discovery)
autocli linkedin recommended --limit 3 -f json

Known Quirks / Pitfalls

  1. GraphQL variable encoding: LinkedIn requires colons (:) and parentheses to remain raw (not URL-encoded) in GraphQL variables. Full encodeURIComponent causes HTTP 400. The adapter uses a partial-encode-then-decode approach.
  2. No total count: The API doesn't return a totalCount field. --limit 0 fetches incrementally until the server returns an empty batch.
  3. No applicant_count: Unlike the REST search API, this GraphQL endpoint's jobPostingCard doesn't include applicant count. Column is preserved but always returns "N/A".
  4. No easyApplyUrl field: Easy Apply detection uses footerItems type — verified via 200-job crawl with ~30% Easy Apply rate.
  5. Dynamic queryId: The GraphQL queryId includes a version hash that may change. The adapter discovers it dynamically via performance.getEntriesByType('resource'), so no hardcoded ID to maintain.
  6. Workplace type parsing: Workplace type is embedded in the location string in parentheses. Regex extracts On-site/Hybrid/Remote and strips it from the location field.

Rick Sanchez and others added 16 commits April 29, 2026 02:18
Adds `linkedin recommended` adapter for crawling LinkedIn JYMBII algorithm
recommended jobs via GraphQL API. Supports automatic pagination, Easy Apply
detection via footerItems EASY_APPLY_TEXT, workplace type parsing, and
unlimited mode (--limit 0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…quest signatures, pagination, and test commands
Local LLM (qwen3) → structured JSON → Supabase pipeline:
- 5-module Python pipeline: config, preprocess, LLM, db, orchestrator
- Grammar-constrained generation via llama.cpp json_schema
- 3-attempt retry at temp=0: standard → repair → minimal
- Atomic claim/upsert via Supabase RPC functions
- Stale processing reaper, dead-letter queue, extraction_runs tracking
- Per-run report: console summary + failed-jobs detail + JSON report

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
linkedin recommended --limit 0 --with_jd triggers long-running commands
that scroll the full job list and fetch descriptions for each, which
can exceed the previous 30-second HTTP timeout.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add clean_linkedin_jobs.py pipeline that extracts URLs from multiple
fields, normalizes URLs, validates LinkedIn records (require easy_apply
or external_url), and maps apply_url/source_channel/apply_type correctly.

Includes:
- clean_linkedin_jobs.py: HTML cleaning, URL extraction cascade,
  salary parsing, batch dedup, dead letter queue
- sync_autocli_jobs.py: Supabase RPC upsert with source_channel/apply_type
- 23 unit tests with TDD (clean + sync + validation + URL mapping)
- 5 migrations: schema, url_hash, source_channel/apply_type,
  drop url_hash unique constraint, old data cleanup
- daemon health check wait in main.rs

bad_count invariant: 776 -> 0 (after cleanup + pipeline fix)
Chrome debugger can detach mid-command on SPA pages (e.g. LinkedIn),
returning "Detached while handling command". This error was not in the
retry list, causing the extension to give up immediately instead of
re-attaching and retrying.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extension `WINDOW_IDLE_TIMEOUT` (30s) would fire during evaluate steps
that run longer than the timeout (e.g. --limit 0 fetching all LinkedIn
recommended jobs). Added activeCommands counter per workspace so the
idle timer only starts when no commands are in-flight.

Added `scripts/autocli-baseline.sh` with 8 pre-flight checks (autocli
binary, Chrome process, daemon, extension, LinkedIn reachability, DNS,
output dir, disk space) with structured timestamped logging and --json
output. Includes 13-test suite at `scripts/test_baseline.sh`.
`check_extension_freshness` compares dist/background.js mtime against a
refresh marker file (.baseline-last-refresh). On first run (no marker)
it warns; when dist is newer than last refresh it fails with a clear
hint to use --refresh-extension.

`--refresh-extension` uses browser-harness CDP to navigate to
chrome://extensions, find the AutoCLI card, and click its reload button,
then updates the marker.

Test suite now has 15 tests covering all freshness scenarios.
sync_autocli_jobs.py looked for "apply_type" key in raw records, but
LinkedIn raw data uses "easy_apply". Records from this pipeline were
silently defaulted to apply_type='unknown'. Added a fallback check for
the "easy_apply" field to correctly classify LinkedIn easy-apply jobs.

Also ran a SQL migration to fix 271 existing rows that were affected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant