|
| 1 | +# Design: Self-Service Data Upload (Issue #86) |
| 2 | + |
| 3 | +**Date:** 2026-02-24 |
| 4 | +**Author:** Claude Code |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +Allow admin and IR users to upload institutional data files directly from the dashboard without |
| 11 | +needing direct database or server access. Two upload paths: course enrollment CSVs (end-to-end |
| 12 | +to Postgres) and PDP cohort/AR files (to Supabase Storage + GitHub Actions ML pipeline trigger). |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Scope |
| 17 | + |
| 18 | +**In scope:** |
| 19 | +- Course enrollment CSV → `course_enrollments` Postgres table (upsert) |
| 20 | +- PDP Cohort CSV / PDP AR (.xlsx) → Supabase Storage + GitHub Actions `repository_dispatch` |
| 21 | +- Preview step (first 10 rows + column validation) before commit |
| 22 | +- Role guard: admin and ir only |
| 23 | + |
| 24 | +**Out of scope:** |
| 25 | +- Upload history log (future issue) |
| 26 | +- Column remapping UI (columns must match known schema) |
| 27 | +- ML experiment tracking / MLflow (future issue) |
| 28 | +- Auto-triggering ML pipeline without a server (GitHub Actions is the trigger mechanism) |
| 29 | + |
| 30 | +--- |
| 31 | + |
| 32 | +## Pages & Routing |
| 33 | + |
| 34 | +**New page:** `codebenders-dashboard/app/admin/upload/page.tsx` |
| 35 | + |
| 36 | +**Role guard:** Add to `lib/roles.ts` `ROUTE_PERMISSIONS`: |
| 37 | +```ts |
| 38 | +{ prefix: "/admin", roles: ["admin", "ir"] }, |
| 39 | +{ prefix: "/api/admin", roles: ["admin", "ir"] }, |
| 40 | +``` |
| 41 | +Middleware already enforces this pattern via `x-user-role` header — no other auth code needed. |
| 42 | + |
| 43 | +**Nav link:** Add "Upload Data" to `nav-header.tsx`, visible only to admin/ir roles. |
| 44 | + |
| 45 | +**New API routes:** |
| 46 | +- `POST /api/admin/upload/preview` — parse first 10 rows, return sample + validation summary |
| 47 | +- `POST /api/admin/upload/commit` — full ingest (course → Postgres; PDP/AR → Storage + Actions) |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +## UI Flow (3 States) |
| 52 | + |
| 53 | +### State 1 — Select & Drop |
| 54 | +- Dropdown: file type (`Course Enrollment CSV` | `PDP Cohort CSV` | `PDP AR File (.xlsx)`) |
| 55 | +- Drag-and-drop zone (click to pick; `.csv` for course/cohort, `.csv`+`.xlsx` for AR) |
| 56 | +- "Preview" button → calls `/api/admin/upload/preview` |
| 57 | + |
| 58 | +### State 2 — Preview |
| 59 | +- Shows: detected file type, estimated row count, first 10 rows in a table |
| 60 | +- Validation banner: lists missing required columns or warnings |
| 61 | +- "Confirm & Upload" → calls `/api/admin/upload/commit` |
| 62 | +- "Back" link to return to State 1 |
| 63 | + |
| 64 | +### State 3 — Result |
| 65 | +- Course enrollments: `{ inserted, skipped, errors[] }` summary card |
| 66 | +- PDP/AR: "File accepted — ML pipeline queued in GitHub Actions" + link to Actions run |
| 67 | +- "Upload another file" resets to State 1 |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## API Routes |
| 72 | + |
| 73 | +### `POST /api/admin/upload/preview` |
| 74 | + |
| 75 | +**Input:** `multipart/form-data` with `file` and `fileType` fields |
| 76 | + |
| 77 | +**Logic:** |
| 78 | +1. Parse first 50 rows with `csv-parse` (CSV) or `xlsx` (Excel) |
| 79 | +2. Validate required columns exist for the given `fileType` |
| 80 | +3. Return `{ columns, sampleRows (first 10), rowCount (estimated), warnings[] }` |
| 81 | + |
| 82 | +### `POST /api/admin/upload/commit` |
| 83 | + |
| 84 | +**Input:** Same multipart form |
| 85 | + |
| 86 | +**Course enrollment path:** |
| 87 | +1. Stream-parse full CSV with `csv-parse` async iterator |
| 88 | +2. Batch-upsert 500 rows at a time into `course_enrollments` via `pg` |
| 89 | +3. Conflict target: `(student_guid, course_prefix, course_number, academic_term)` |
| 90 | +4. Return `{ inserted, skipped, errors[] }` |
| 91 | + |
| 92 | +**PDP/AR path:** |
| 93 | +1. Upload file to Supabase Storage bucket `pdp-uploads` via `@supabase/supabase-js` |
| 94 | +2. Call GitHub API `POST /repos/{owner}/{repo}/dispatches` with: |
| 95 | + ```json |
| 96 | + { "event_type": "ml-pipeline", "client_payload": { "file_path": "<storage-path>" } } |
| 97 | + ``` |
| 98 | +3. Return `{ status: "processing", actionsUrl: "https://github.com/{owner}/{repo}/actions" }` |
| 99 | + |
| 100 | +**Role enforcement:** Read `x-user-role` header (set by middleware); return 403 if not admin/ir. |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## GitHub Actions Workflow |
| 105 | + |
| 106 | +**File:** `.github/workflows/ml-pipeline.yml` |
| 107 | + |
| 108 | +**Trigger:** `repository_dispatch` with `event_type: ml-pipeline` |
| 109 | + |
| 110 | +**Steps:** |
| 111 | +1. Checkout repo |
| 112 | +2. Set up Python with `venv` |
| 113 | +3. Install dependencies (`pip install -r requirements.txt`) |
| 114 | +4. Download uploaded file from Supabase Storage using `SUPABASE_SERVICE_KEY` secret |
| 115 | +5. Run `venv/bin/python ai_model/complete_ml_pipeline.py --input <downloaded-file-path>` |
| 116 | +6. Upload `ML_PIPELINE_REPORT.txt` as a GitHub Actions artifact (retained 90 days) |
| 117 | + |
| 118 | +**Required secrets:** `SUPABASE_URL`, `SUPABASE_SERVICE_KEY`, `GITHUB_TOKEN` (auto-provided) |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +## Required Column Schemas |
| 123 | + |
| 124 | +### Course Enrollment CSV |
| 125 | +Must include: `student_guid`, `course_prefix`, `course_number`, `academic_year`, `academic_term` |
| 126 | +Optional (all other `course_enrollments` columns): filled as NULL if absent |
| 127 | + |
| 128 | +### PDP Cohort CSV |
| 129 | +Must include: `Institution_ID`, `Cohort`, `Student_GUID`, `Cohort_Term` |
| 130 | + |
| 131 | +### PDP AR File (.xlsx) |
| 132 | +Must include: `Institution_ID`, `Cohort`, `Student_GUID` (first sheet parsed) |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## New Packages |
| 137 | + |
| 138 | +| Package | Purpose | |
| 139 | +|---------|---------| |
| 140 | +| `csv-parse` | Streaming CSV parsing (async iterator mode) | |
| 141 | +| `xlsx` | Excel (.xlsx) parsing | |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## New Files |
| 146 | + |
| 147 | +| File | Purpose | |
| 148 | +|------|---------| |
| 149 | +| `codebenders-dashboard/app/admin/upload/page.tsx` | Upload UI page | |
| 150 | +| `codebenders-dashboard/app/api/admin/upload/preview/route.ts` | Preview API route | |
| 151 | +| `codebenders-dashboard/app/api/admin/upload/commit/route.ts` | Commit API route | |
| 152 | +| `.github/workflows/ml-pipeline.yml` | GitHub Actions ML pipeline trigger | |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Supabase Changes |
| 157 | + |
| 158 | +**Storage bucket:** Create `pdp-uploads` bucket (private, authenticated access only). |
| 159 | +No new database migrations required — `course_enrollments` table already exists. |
| 160 | + |
| 161 | +**Bucket policy:** Only service role key can read/write. Signed URLs used for pipeline download. |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Constraints & Known Limitations |
| 166 | + |
| 167 | +- ML pipeline trigger via GitHub Actions means a ~30-60s delay before the pipeline starts |
| 168 | +- Vercel free tier has a 4.5 MB request body limit — large files should use Supabase Storage direct upload in a future iteration |
| 169 | +- No upload history log in this version (deferred) |
| 170 | +- Column remapping is out of scope — files must match the known schema |
0 commit comments