Version/v0.3#24
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63406a2e56
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with httpx.stream( | ||
| "GET", | ||
| current_url, |
There was a problem hiding this comment.
Bind icon proxy fetches to the validated address
For attacker-controlled icon hosts, this fetch re-resolves current_url after _validate_icon_proxy_url has already done its safety check, so DNS rebinding can return a public address during validation and a loopback/link-local address for the actual httpx.stream connection. Because this endpoint proxies arbitrary user-supplied icon URLs, that still allows SSRF against internal services despite the private-address guard; the fetch needs to use the validated resolved address (or otherwise prevent a second unsafe resolution) for every redirect hop.
Useful? React with 👍 / 👎.
|
当前任务:修改首页,去掉杂七杂八的,只保留简单的爬取博客统计,以及一个大的酷酷的搜索框!用于搜索自己的博客是否爬取到了,如果搜索到了,可以顺势点进去,写一个博客详情界面,展示博客基础信息 |
|
博客详情页初步工程差不多了,进一步的构思也有,就是保留用户的历史浏览路径,让用户可以一路点下去?但是感觉前端上会有些复杂🤔而且构思不完善,所以先跳过吧 |
|
哦不对,怎么pytest没过。问题不大。修一修 |
|
下一步有点想补一下基建,比如记录用户的搜索历史,记录随机博客中每次刷新的9个url,以及用户是否有点击其中哪一个,每个url总共的点击次数等。考虑开1个或者多个数据表来重新记录这些日志信息?为后面推荐系统,或者模型训练工作做铺垫?但是完全没接触过这种><,不知道怎么做比较好,感觉想想就很复杂。而且这种最好一开始就考虑好确认好,改来改去的话,浪费时间,而且容易让历史数据失效之类的qwq,额啊,来个大佬帮我直接把这块搭完吧———— |
总之先补全:
以normalized_url为key来记录吧,考虑到现在用户系统不完善,因此先统一用-1作为游客id来记录?我看行 |
|
哦,用户的ip是不是能记录的啊? |
|
或者先把数据保存到本地?这样未来用户登陆账号的时候,可以直接读取本地数据。。? |
|
要不把用户系统弄完整了再结束这个version吧) |
|
顺便把统计功能也加上吧)admin界面里加点后台管理,比如用户点击率,访问率,之类的 |
There was a problem hiding this comment.
Pull request overview
This PR is a broad “v0.3” iteration that expands HeyBlog’s crawler + persistence boundary (durable seeds, acceptance vs crawl status, richer crawl error reporting), adds user lifecycle capabilities (email verification + password reset with SMTP/noop delivery), and upgrades the frontend to support recommendation/event tracking plus improved visualization tooling (benchmark graph + rendering progress).
Changes:
- Add durable seed table + acceptance status fields; simplify crawler queue claiming; persist crawl error kind/message; validate favicon URLs before storing.
- Introduce user lifecycle flows (pending registrations, verification/reset tokens, SMTP delivery) and recommendation request/impression/interaction tracking with admin hourly stats.
- Frontend updates for tracked interactions, admin gating via verified admin session, blog detail navigation, and visualization benchmark + rendering progress UI.
Reviewed changes
Copilot reviewed 79 out of 82 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tracker/log-system.md | Remove legacy logging tracker doc |
| tests/test_visualization_benchmark.py | Add benchmark graph generator test |
| tests/test_site_metadata.py | Update favicon extraction expectations |
| tests/test_runtime.py | Update runtime queue behavior tests |
| tests/test_pipeline.py | Add seed + icon + edge regression tests |
| tests/test_graph_projection.py | Expand core-view seed strategy tests |
| tests/test_filters.py | Update TLD-blocking behavior tests |
| shared/http_clients/persistence_http.py | Extend persistence HTTP API client |
| shared/config.py | Add public URL + email/SMTP settings |
| seed.csv | Update default seed list |
| scripts/run_visualization_benchmark.sh | Add script to run benchmark locally |
| scripts/generate_visualization_benchmark.py | Add deterministic benchmark graph generator |
| readme.md | Add project timeline + quick-start note |
| pyproject.toml | Add extract-favicon + ruff deps |
| persistence_api/models.py | Add seeds/users/tokens/audit/reco/admin stats models |
| persistence_api/main.py | Add endpoints for seeds, auth lifecycle, reco, admin stats |
| persistence_api/graph_projection.py | Adjust available graph + seed strategy limiting |
| persistence_api/email_delivery.py | Add SMTP/noop email delivery adapter |
| memory/MEMORY.md | Remove memory index doc |
| memory/filter-chain-two-phase-architecture.md | Remove filter-chain memory doc |
| frontend/src/types/graph.ts | Extend types for detail/reco/admin/auth |
| frontend/src/pages/VisualizationPage.tsx | Add benchmark mode + render progress + compact/full toggle |
| frontend/src/pages/RandomBlogPage.tsx | Switch to random-batch API + interaction tracking |
| frontend/src/pages/ProfilePage.tsx | Add verify/forgot/reset flows and UI states |
| frontend/src/pages/AdminPage.tsx | Add hourly stats + prefer admin session token |
| frontend/src/pages/AboutPage.tsx | Update styling + minor link UI |
| frontend/src/lib/icon.ts | Add multi-candidate favicon + proxy helpers |
| frontend/src/lib/icon.test.ts | Add icon helper unit tests |
| frontend/src/lib/blogInteractions.ts | Add visitor/session + event tracking helpers |
| frontend/src/lib/benchmarkGraph.ts | Add loader for static benchmark graph |
| frontend/src/lib/auth.ts | Add hasStoredAdminSession helper |
| frontend/src/components/SubmitBlogDialog.tsx | Switch to user-seed submission (no email) |
| frontend/src/components/Navigation.tsx | Show admin nav only for verified admin session |
| frontend/src/components/MissingBlogConfirmDialog.tsx | Add confirm dialog for missing blog search |
| frontend/src/components/GraphVisualization.test.tsx | Expand force-graph tests for new behavior |
| frontend/src/components/BlogExternalLink.tsx | Add tracked external-link component |
| frontend/src/components/BlogDetailPanel.tsx | Use tracked external link for URL |
| frontend/src/components/BlogDetailLink.tsx | Add tracked blog-detail navigation button |
| frontend/src/components/BlogCard.tsx | Track external opens from cards |
| frontend/src/App.tsx | Add blog detail route + guarded admin route |
| frontend/server.py | Forward cache-control from proxied API |
| frontend/package.json | Add react-force-graph-2d dependency |
| frontend/package-lock.json | Lockfile updates for new dependency |
| docker-compose.yml | Add public/email/SMTP env wiring for persistence |
| doc/service-architecture.md | Update reset semantics + remove dedup mention |
| doc/public-admin-boundary.md | Update public/admin capabilities + auth boundary |
| doc/crawler-url-filtering.md | Update queue claim + icon validation + error-kind docs |
| doc/config-reference.md | Document new public/email/SMTP config |
| crawler/runtime/service.py | Remove priority fairness; simple waiting-queue claim |
| crawler/README.md | Update runtime responsibility description |
| crawler/main.py | Remove priority fairness wiring |
| crawler/filters.py | Remove unused imports after rule changes |
| crawler/crawling/pipeline.py | Simplify claim loop; persist crawl error kind/message |
| crawler/crawling/orchestrator.py | Validate icon URLs; persist edges for duplicate targets |
| crawler/crawling/metadata.py | Use extract-favicon; stop synthesizing fallback favicon |
| crawler/crawling/fetching/httpx_fetcher.py | Add icon URL validation request |
| crawler/crawling/fetching/base.py | Add validate_icon_url interface |
| crawler/crawling/decisions/rule_helpers.py | Allow .org; keep .gov/.edu blocked |
| crawler/crawling/decisions/chain.py | Add implicit success deciders when config omits them |
| crawler/crawling/bootstrap.py | Replay durable seeds; persist seed provenance |
| alembic/versions/20260611_02_add_admin_hourly_stats.py | Add admin hourly stats table migration |
| alembic/versions/20260611_01_add_pending_user_registrations.py | Add pending registrations migration |
| alembic/versions/20260609_01_extend_user_system.py | Extend users + add tokens/audit migrations |
| alembic/versions/20260607_04_drop_recommendation_blog_id_columns.py | Normalize reco event schema migration |
| alembic/versions/20260607_03_drop_deprecated_ingestion_and_dedup_tables.py | Drop deprecated ingestion/dedup tables |
| alembic/versions/20260607_02_add_blog_interaction_entrance_fields.py | Add entrance metadata migration |
| alembic/versions/20260607_01_add_recommendation_event_tables.py | Add recommendation request/impression/interaction tables |
| alembic/versions/20260606_01_add_seed_table.py | Add durable seeds table migration |
| alembic/versions/20260602_01_add_blog_acceptance_status.py | Split acceptance vs crawl execution migration |
| alembic/versions/20260423_02_add_blog_business_key_schema.py | Remove deprecated FK rewrites |
| .env.example | Add public/email/SMTP example vars |
Files not reviewed (1)
- frontend/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public_base_url=os.getenv("HEYBLOG_PUBLIC_BASE_URL", "http://127.0.0.1:3000").rstrip("/"), | ||
| email_provider=os.getenv("HEYBLOG_EMAIL_PROVIDER", "disabled").strip().lower() or "disabled", | ||
| email_from=os.getenv("HEYBLOG_EMAIL_FROM", "").strip(), | ||
| email_dev_expose_tokens=_parse_bool_env("HEYBLOG_EMAIL_DEV_EXPOSE_TOKENS", default=True), |
我也不知道改了啥
大概是杂七杂八的一大堆前端逻辑后端逻辑吧
哦,还修了边存储逻辑,将数据库中爬虫状态和博客状态区分开来,避免语义混淆
还没改完,打算把前端再修一修,把博客详情整出来,就差不多0.3结束