Skip to content

feat(teams): persist aadGroupId to Mongo for parity with other platforms#211

Merged
alan5543 merged 3 commits into
mainfrom
feat/teams-aadgroupid-persistence
Jun 1, 2026
Merged

feat(teams): persist aadGroupId to Mongo for parity with other platforms#211
alan5543 merged 3 commits into
mainfrom
feat/teams-aadgroupid-persistence

Conversation

@alan5543
Copy link
Copy Markdown
Member

@alan5543 alan5543 commented Jun 1, 2026

Recreated from #208, which GitHub auto-closed when its base branch (feat/teams-graph-ingestion) was deleted after #206 merged. Branch has been rebased onto main; the diff is identical to the reviewed #208.

Why

In the screenshot the user reported, every other platform shows up in the sidebar (Discord 4/13, Mattermost 1/11, Slack 1/2, Discord-TRIP 1/3) but the Teams workspace is gone, and the Manage Channels modal for Teams shows "No channels found". Investigation showed it's not a UI filter at all — listChannels returned [] because the bot had no aadGroupId to query Microsoft Graph with.

Platform listChannels source Survives Redis loss / bot restart?
Slack bot token + conversations.list ✅ token in Mongo
Discord bot token + /users/@me/guilds then /guilds/{id}/channels ✅ token in Mongo
Mattermost bot token + /users/me/teams//channels ✅ token in Mongo
Teams (before) aadGroupId observed from Bot Framework webhook, cached in chat-adapter Redis only ❌ Redis wipe = identity gone until next webhook; 30-day TTL on the cache
Teams (this PR) aadGroupId observed from webhook, persisted to Mongo via fire-and-forget POST; bot startup seeds the in-memory Map from Mongo ✅ identical to the other three

PR #207 (Redis AOF + persistent volume) reduces the failure window for dev compose but doesn't fix the asymmetry: prod managed Redis already has AOF, and we'd still fail on a TTL expiry, an adapter rebuild before the next webhook, or an explicit cache flush. Mongo persistence is the actual parity fix.

What

  1. Pydantic model (PlatformConnection.teams_known_team_ids: list[str], default []). No DB migration — legacy docs decode through the default.
  2. Store: add_teams_known_team_id(connection_id, aad_group_id) upserts via $addToSet so concurrent webhooks can't double-insert.
  3. Internal API:
    • GET /api/internal/connections/credentials now returns teams_known_team_ids so the bot's startup loader sees it.
    • POST /api/internal/connections/{id}/teams-known-team-ids accepts a validated AAD group GUID, 404s on unknown connection, 422 when the connection isn't a Teams row, 400 on a malformed GUID. Behind the existing require_bridge gate.
  4. Bot bridge (bot/src/bridge.ts):
    • New exported seedTeamsKnownTeamIds(connId, ids[]) for startup hydration. Also flips teamsColdStartScanned[connId] so a stale Redis cache can't race the hydrated state.
    • recordTeamsConversation fires a fire-and-forget POST whenever a NEW aadGroupId arrives (dedup in-memory before queueing). Backend dedups again via $addToSet. 5s timeout, swallowed errors so a slow backend can't block the webhook handler.
  5. Bot startup (bot/src/index.ts): syncConnectionsFromBackend calls seedTeamsKnownTeamIds after registering each Teams adapter.

Test plan

  • 9/9 new bot unit tests (bridge.teams-persistence.test.ts): seed hydrates / no-op on empty / per-connection scoped / idempotent; write-through POSTs once / dedups same id / re-fires different id / rejects malformed GUID / no-op when aadGroupId absent.
  • 88/88 bot bridge tests pass (was 79).
  • 14/14 tests/test_platform_store.py tests pass: default [], round-trip preserves ids, legacy docs decode, $addToSet operator + updated_at touch asserted, returns None on missing connection.
  • bot npm run lint: 0 errors. tsc --noEmit clean.
  • Python ruff check clean on all three changed files.
  • Live restart drill (operator, post-merge):
    1. Post one Teams message in any installed channel so the bot observes an aadGroupId.
    2. Confirm Mongo platform_connections row has the id in teams_known_team_ids.
    3. docker compose down -v && docker compose up -d.
    4. Confirm GET /api/connections/{teams_id}/channels returns the channels with NO further webhook.

Behavior change

  • New field on PlatformConnection; legacy rows decode cleanly via Pydantic default.
  • New internal endpoint behind require_bridge.
  • Bot's startup connection-sync log now reports seeded N known team id(s) for Teams connection ... when ids are hydrated.

🤖 Generated with Claude Code

alan5543 and others added 2 commits June 1, 2026 00:08
Other platforms (Slack/Discord/Mattermost) bootstrap channel listings
from a bot token stored in Mongo — identity survives bot restart and
Redis loss without an inbound webhook. Teams was the outlier: there is
no app-only Microsoft Graph endpoint to enumerate "teams this app is
installed in", so the team's AAD group id was observed from Bot
Framework activities and cached ONLY in the chat-adapter's Redis
(`chat-sdk:cache:teams:channelContext:*`). A Redis container restart or
30-day cache TTL erased that identity and the Teams workspace vanished
from the sidebar until the next inbound webhook reseeded it.

This change makes Teams self-bootstrap exactly like the others:

  1. Pydantic model: new `teams_known_team_ids: list[str]` on
     `PlatformConnection` (default `[]`). No DB migration — legacy docs
     decode through the default.

  2. Store: `add_teams_known_team_id` upserts via `$addToSet` so
     concurrent writes from multiple webhook deliveries can't double-
     insert without holding a lock.

  3. Internal API: the existing
     `GET /api/internal/connections/credentials` now returns
     `teams_known_team_ids` so the bot's startup loader sees them. A new
     `POST /api/internal/connections/{id}/teams-known-team-ids` accepts
     a validated AAD group GUID and persists it. Both behind the
     existing `require_bridge` gate.

  4. Bot bridge: `recordTeamsConversation` now fires a fire-and-forget
     POST whenever a NEW aadGroupId arrives for a connection (dedup
     in-memory before queueing). Backend dedups again via `$addToSet`.

  5. Bot startup: `syncConnectionsFromBackend` calls a new exported
     `seedTeamsKnownTeamIds(connId, ids)` after registering each Teams
     adapter, hydrating the in-memory `teamsKnownTeamIds` Map straight
     from Mongo. `seedTeamsKnownTeamIds` also flips
     `teamsColdStartScanned` for that connection so the Redis scan path
     can't race the hydrated state.

Result: a `docker compose down -v && up -d` (or any Redis/bot restart)
returns the Teams workspace + channels to the sidebar on first
listChannels — no webhook required. Existing connections benefit on
their FIRST inbound activity after deploy: the write-through fires,
Mongo gets the team-id, every subsequent restart hydrates cleanly.

Verified:
- 9/9 new bot unit tests (bridge.teams-persistence.test.ts): seed
  hydrates / no-op on empty / per-connection scoped / idempotent;
  write-through POSTs once / dedups same id / re-fires different id /
  rejects malformed GUID / no-op when aadGroupId absent.
- 88/88 bot bridge tests pass (was 79).
- 14/14 platform_store tests pass: default `[]`, round-trip preserves
  ids, legacy docs decode, `$addToSet` operator + `updated_at` touch
  asserted, returns None on missing connection.
- bot lint: 0 errors. tsc --noEmit clean.
- Python ruff clean on all three changed files.

Constraint: Bot Framework hands the team identity via webhooks only;
  no app-only Graph endpoint for "list installed teams" exists.
Constraint: The fire-and-forget POST must not block webhook
  processing — 5s timeout + caught errors + dedup.
Rejected: Reading the chat-adapter Redis cache as primary source |
  cache TTL (30d) and restart-fragility was the original bug.
Rejected: Migrating PlatformConnection schema | Pydantic default `[]`
  covers legacy rows, no down-revision needed.
Confidence: high
Scope-risk: narrow — additive (model field, one store method, one new
  endpoint, one bridge export, one startup seed call).
Directive: `seedTeamsKnownTeamIds` MUST also flip
  `teamsColdStartScanned[connectionId]` so a stale Redis cache can't
  race the hydrated Map state. Do not separate these without re-
  testing the restart drill.
Not-tested: Live `docker compose down -v && up -d` drill — requires
  rebuilt bot + backend images on a live stack; deferred to merge
  validation by the operator. The unit test surface covers the
  helpers; the smoke test plan is in the PR body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cold-start Redis scan in `resolveTeamIds` pre-populates the
in-memory `teamsKnownTeamIds` Map when an existing connection has
channelContext entries cached from a prior run. After that, every
subsequent webhook short-circuits the write-through in
`recordTeamsConversation` (because its dedup checks the in-memory
state), so an existing connection upgraded to this PR would never
seed `teams_known_team_ids` in Mongo and bot-restart-survives-Redis-
wipe never engages.

Fix: when the cold-start scan adds a NEW id to the Map, also fire the
fire-and-forget write-through. Backend `$addToSet` keeps it
idempotent. The new-connection path is unchanged — first webhook
still seeds Mongo via the existing `recordTeamsConversation` branch.

Verified:
- 88/88 bridge tests still pass (no behavior change for new connections).
- tsc --noEmit clean.

Confidence: high
Scope-risk: narrow — single block inside the existing scan loop.
Directive: keep the in-memory `teamSet.add(tId)` AFTER the `wasNew`
  check; reversing the order silently never persists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread bot/src/bridge.ts Fixed
Two CI fixes surfaced by the first full-suite run on this branch (it only
got smoke tests while stacked on #206):

- ruff format: connections.py had a 3-line wrap ruff collapses to 1
- CodeQL js/tainted-format-string (HIGH): the persist-failure warn passed
  an interpolated template string as console.warn's first arg alongside a
  second arg — Node treats arg[0] as a printf format string when more args
  follow, so a connectionId containing %s/%j would hijack substitution.
  Switched to the constant-format-string + %s args pattern already used
  elsewhere in this file (e.g. connection route error logging).

Confidence: high
Scope-risk: narrow

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alan5543 alan5543 merged commit 94d06b7 into main Jun 1, 2026
9 checks passed
@alan5543 alan5543 deleted the feat/teams-aadgroupid-persistence branch June 1, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants