Skip to content

Version/v0.3#24

Merged
ligeaaa merged 36 commits into
mainfrom
version/v0.3
Jun 11, 2026
Merged

Version/v0.3#24
ligeaaa merged 36 commits into
mainfrom
version/v0.3

Conversation

@ligeaaa

@ligeaaa ligeaaa commented Jun 5, 2026

Copy link
Copy Markdown
Member

我也不知道改了啥
大概是杂七杂八的一大堆前端逻辑后端逻辑吧
哦,还修了边存储逻辑,将数据库中爬虫状态和博客状态区分开来,避免语义混淆

还没改完,打算把前端再修一修,把博客详情整出来,就差不多0.3结束

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63406a2e56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/main.py
Comment on lines +178 to +180
with httpx.stream(
"GET",
current_url,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bind icon proxy fetches to the validated address

For attacker-controlled icon hosts, this fetch re-resolves current_url after _validate_icon_proxy_url has already done its safety check, so DNS rebinding can return a public address during validation and a loopback/link-local address for the actual httpx.stream connection. Because this endpoint proxies arbitrary user-supplied icon URLs, that still allows SSRF against internal services despite the private-address guard; the fetch needs to use the validated resolved address (or otherwise prevent a second unsafe resolution) for every redirect hop.

Useful? React with 👍 / 👎.

@ligeaaa

ligeaaa commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

当前任务:修改首页,去掉杂七杂八的,只保留简单的爬取博客统计,以及一个大的酷酷的搜索框!用于搜索自己的博客是否爬取到了,如果搜索到了,可以顺势点进去,写一个博客详情界面,展示博客基础信息

@ligeaaa

ligeaaa commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

博客详情页初步工程差不多了,进一步的构思也有,就是保留用户的历史浏览路径,让用户可以一路点下去?但是感觉前端上会有些复杂🤔而且构思不完善,所以先跳过吧

@ligeaaa

ligeaaa commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

哦不对,怎么pytest没过。问题不大。修一修

@ligeaaa

ligeaaa commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

下一步有点想补一下基建,比如记录用户的搜索历史,记录随机博客中每次刷新的9个url,以及用户是否有点击其中哪一个,每个url总共的点击次数等。考虑开1个或者多个数据表来重新记录这些日志信息?为后面推荐系统,或者模型训练工作做铺垫?但是完全没接触过这种><,不知道怎么做比较好,感觉想想就很复杂。而且这种最好一开始就考虑好确认好,改来改去的话,浪费时间,而且容易让历史数据失效之类的qwq,额啊,来个大佬帮我直接把这块搭完吧————

@ligeaaa

ligeaaa commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

下一步有点想补一下基建,比如记录用户的搜索历史,记录随机博客中每次刷新的9个url,以及用户是否有点击其中哪一个,每个url总共的点击次数等。考虑开1个或者多个数据表来重新记录这些日志信息?为后面推荐系统,或者模型训练工作做铺垫?但是完全没接触过这种><,不知道怎么做比较好,感觉想想就很复杂。而且这种最好一开始就考虑好确认好,改来改去的话,浪费时间,而且容易让历史数据失效之类的qwq,额啊,来个大佬帮我直接把这块搭完吧————

总之先补全:

  1. 记录随机博客中每次刷新的9个url,以及用户是否有点击其中哪一个,点击的顺序
  2. 每个url总共的点击次数

以normalized_url为key来记录吧,考虑到现在用户系统不完善,因此先统一用-1作为游客id来记录?我看行

@ligeaaa

ligeaaa commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

哦,用户的ip是不是能记录的啊?

@ligeaaa

ligeaaa commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

或者先把数据保存到本地?这样未来用户登陆账号的时候,可以直接读取本地数据。。?

@ligeaaa

ligeaaa commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

要不把用户系统弄完整了再结束这个version吧)

@ligeaaa

ligeaaa commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

顺便把统计功能也加上吧)admin界面里加点后台管理,比如用户点击率,访问率,之类的

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a broad “v0.3” iteration that expands HeyBlog’s crawler + persistence boundary (durable seeds, acceptance vs crawl status, richer crawl error reporting), adds user lifecycle capabilities (email verification + password reset with SMTP/noop delivery), and upgrades the frontend to support recommendation/event tracking plus improved visualization tooling (benchmark graph + rendering progress).

Changes:

  • Add durable seed table + acceptance status fields; simplify crawler queue claiming; persist crawl error kind/message; validate favicon URLs before storing.
  • Introduce user lifecycle flows (pending registrations, verification/reset tokens, SMTP delivery) and recommendation request/impression/interaction tracking with admin hourly stats.
  • Frontend updates for tracked interactions, admin gating via verified admin session, blog detail navigation, and visualization benchmark + rendering progress UI.

Reviewed changes

Copilot reviewed 79 out of 82 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tracker/log-system.md Remove legacy logging tracker doc
tests/test_visualization_benchmark.py Add benchmark graph generator test
tests/test_site_metadata.py Update favicon extraction expectations
tests/test_runtime.py Update runtime queue behavior tests
tests/test_pipeline.py Add seed + icon + edge regression tests
tests/test_graph_projection.py Expand core-view seed strategy tests
tests/test_filters.py Update TLD-blocking behavior tests
shared/http_clients/persistence_http.py Extend persistence HTTP API client
shared/config.py Add public URL + email/SMTP settings
seed.csv Update default seed list
scripts/run_visualization_benchmark.sh Add script to run benchmark locally
scripts/generate_visualization_benchmark.py Add deterministic benchmark graph generator
readme.md Add project timeline + quick-start note
pyproject.toml Add extract-favicon + ruff deps
persistence_api/models.py Add seeds/users/tokens/audit/reco/admin stats models
persistence_api/main.py Add endpoints for seeds, auth lifecycle, reco, admin stats
persistence_api/graph_projection.py Adjust available graph + seed strategy limiting
persistence_api/email_delivery.py Add SMTP/noop email delivery adapter
memory/MEMORY.md Remove memory index doc
memory/filter-chain-two-phase-architecture.md Remove filter-chain memory doc
frontend/src/types/graph.ts Extend types for detail/reco/admin/auth
frontend/src/pages/VisualizationPage.tsx Add benchmark mode + render progress + compact/full toggle
frontend/src/pages/RandomBlogPage.tsx Switch to random-batch API + interaction tracking
frontend/src/pages/ProfilePage.tsx Add verify/forgot/reset flows and UI states
frontend/src/pages/AdminPage.tsx Add hourly stats + prefer admin session token
frontend/src/pages/AboutPage.tsx Update styling + minor link UI
frontend/src/lib/icon.ts Add multi-candidate favicon + proxy helpers
frontend/src/lib/icon.test.ts Add icon helper unit tests
frontend/src/lib/blogInteractions.ts Add visitor/session + event tracking helpers
frontend/src/lib/benchmarkGraph.ts Add loader for static benchmark graph
frontend/src/lib/auth.ts Add hasStoredAdminSession helper
frontend/src/components/SubmitBlogDialog.tsx Switch to user-seed submission (no email)
frontend/src/components/Navigation.tsx Show admin nav only for verified admin session
frontend/src/components/MissingBlogConfirmDialog.tsx Add confirm dialog for missing blog search
frontend/src/components/GraphVisualization.test.tsx Expand force-graph tests for new behavior
frontend/src/components/BlogExternalLink.tsx Add tracked external-link component
frontend/src/components/BlogDetailPanel.tsx Use tracked external link for URL
frontend/src/components/BlogDetailLink.tsx Add tracked blog-detail navigation button
frontend/src/components/BlogCard.tsx Track external opens from cards
frontend/src/App.tsx Add blog detail route + guarded admin route
frontend/server.py Forward cache-control from proxied API
frontend/package.json Add react-force-graph-2d dependency
frontend/package-lock.json Lockfile updates for new dependency
docker-compose.yml Add public/email/SMTP env wiring for persistence
doc/service-architecture.md Update reset semantics + remove dedup mention
doc/public-admin-boundary.md Update public/admin capabilities + auth boundary
doc/crawler-url-filtering.md Update queue claim + icon validation + error-kind docs
doc/config-reference.md Document new public/email/SMTP config
crawler/runtime/service.py Remove priority fairness; simple waiting-queue claim
crawler/README.md Update runtime responsibility description
crawler/main.py Remove priority fairness wiring
crawler/filters.py Remove unused imports after rule changes
crawler/crawling/pipeline.py Simplify claim loop; persist crawl error kind/message
crawler/crawling/orchestrator.py Validate icon URLs; persist edges for duplicate targets
crawler/crawling/metadata.py Use extract-favicon; stop synthesizing fallback favicon
crawler/crawling/fetching/httpx_fetcher.py Add icon URL validation request
crawler/crawling/fetching/base.py Add validate_icon_url interface
crawler/crawling/decisions/rule_helpers.py Allow .org; keep .gov/.edu blocked
crawler/crawling/decisions/chain.py Add implicit success deciders when config omits them
crawler/crawling/bootstrap.py Replay durable seeds; persist seed provenance
alembic/versions/20260611_02_add_admin_hourly_stats.py Add admin hourly stats table migration
alembic/versions/20260611_01_add_pending_user_registrations.py Add pending registrations migration
alembic/versions/20260609_01_extend_user_system.py Extend users + add tokens/audit migrations
alembic/versions/20260607_04_drop_recommendation_blog_id_columns.py Normalize reco event schema migration
alembic/versions/20260607_03_drop_deprecated_ingestion_and_dedup_tables.py Drop deprecated ingestion/dedup tables
alembic/versions/20260607_02_add_blog_interaction_entrance_fields.py Add entrance metadata migration
alembic/versions/20260607_01_add_recommendation_event_tables.py Add recommendation request/impression/interaction tables
alembic/versions/20260606_01_add_seed_table.py Add durable seeds table migration
alembic/versions/20260602_01_add_blog_acceptance_status.py Split acceptance vs crawl execution migration
alembic/versions/20260423_02_add_blog_business_key_schema.py Remove deprecated FK rewrites
.env.example Add public/email/SMTP example vars
Files not reviewed (1)
  • frontend/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread shared/config.py Outdated
Comment thread shared/config.py Outdated
public_base_url=os.getenv("HEYBLOG_PUBLIC_BASE_URL", "http://127.0.0.1:3000").rstrip("/"),
email_provider=os.getenv("HEYBLOG_EMAIL_PROVIDER", "disabled").strip().lower() or "disabled",
email_from=os.getenv("HEYBLOG_EMAIL_FROM", "").strip(),
email_dev_expose_tokens=_parse_bool_env("HEYBLOG_EMAIL_DEV_EXPOSE_TOKENS", default=True),
Comment thread docker-compose.yml
Comment thread .env.example
@ligeaaa ligeaaa merged commit 05e97ab into main Jun 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants