Skip to content

Commit 6128310

Browse files
jayana-cpcPineND
andauthored
allow private bucket access w/ cloudflare auth (#1098)
* cloudflare backups worker * cloudflare backups worker w/ private bucket access * clarified docs * fix backups-worker TypeScript types * verify Cloudflare Access JWT for private backups and add public backup notice * change from allowlist to blocklist * update docs --------- Co-authored-by: Pine Nguyen <pinenguyen@berkeley.edu>
1 parent f0a8903 commit 6128310

13 files changed

Lines changed: 1975 additions & 8 deletions

File tree

apps/docs/src/core/data/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,15 @@ At its core, Berkeleytime serves as a data aggregation platform. We work directl
99

1010
Understanding the data sources Berkeleytime has access to is imperative for building streamlined services.
1111

12+
## Backups and Access
13+
14+
Production backups may contain sensitive data:
15+
16+
- Public backups are **redacted** and are not a comprehensive dataset.
17+
- Full backups require Cloudflare Access.
18+
19+
For details, see [Runbooks](../infrastructure/runbooks.md#fetch-mongo-backups).
20+
1221
## API Central
1322

1423
The EIS maintains many [RESTful](https://en.wikipedia.org/wiki/REST) APIs that consolidate data from various other sources, and provides documentation in the form of [Swagger OpenAPI v3 specifications](https://swagger.io/specification/) for each API. [API Central](https://developers.api.berkeley.edu/) serves as a portal for requesting access to individual APIs, interactive documentation, and managing API usage. Berkeleytime only has access to and utilizes the APIs necessary for servicing students.

apps/docs/src/core/infrastructure/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,7 @@ Software infrastructure refers to the services and tools that create an underlyi
1111

1212
> [!IMPORTANT]
1313
> We aim to use a **small** set of **existing** infrastructure solutions with large communities. This philosophy reduces the [cognitive load](https://thevaluable.dev/cognitive-load-theory-software-developer/) on each developer and simplifies the onboarding process, both of which are valuable for creating long-lasting software in a team where developers are typically cycled out after only ~4 years.
14+
15+
## Backups
16+
17+
Mongo backups are served from `https://backups.berkeleytime.com`. Download steps live in [Runbooks](./runbooks.md#fetch-mongo-backups).

apps/docs/src/core/infrastructure/runbooks.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,41 @@
2222
k create job --from cronjob/bt-prod-datapuller-courses bt-prod-datapuller-courses-manual-01
2323
```
2424

25+
## Fetch Mongo Backups
26+
27+
Backups are served at `https://backups.berkeleytime.com`:
28+
29+
- Public: `GET /public/*`
30+
- Private: `GET /private/*`
31+
32+
### Public backup (no auth)
33+
34+
Public backups are meant for local development and include only a redacted subset of the `bt` database. The public backup includes these collections:
35+
36+
- `classes`
37+
- `courses`
38+
- `terms`
39+
- `sections`
40+
- `gradeDistributions`
41+
- `enrollmentHistories`
42+
- `enrollmenttimeframes`
43+
44+
```sh
45+
curl -f -o "prod_public_backup-YYYYMMDD.gz" \
46+
"https://backups.berkeleytime.com/public/daily/prod_public_backup-YYYYMMDD.gz"
47+
```
48+
49+
### Private backup (Cloudflare Access)
50+
51+
```sh
52+
brew install cloudflare/cloudflare/cloudflared
53+
cloudflared access login https://backups.berkeleytime.com
54+
55+
cloudflared access curl \
56+
"https://backups.berkeleytime.com/private/hourly/prod_backup-YYYYMMDDHH.gz" \
57+
-o "prod_backup-YYYYMMDDHH.gz"
58+
```
59+
2560
## Deploying a New Environment Variable with sealed-secrets
2661

2762
Useful when adding new environment variables to `.env`. To ensure our env variables can be deployed to GitHub without their true value being leaked, they should be encrypted before being pushed to GitHub.

apps/docs/src/getting-started/local-development.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,8 @@ A seeded database is required for some pages on the frontend.
9090
docker compose up -d
9191

9292
# Download the data
93-
curl -f -o "prod-backup.gz" "https://backups.berkeleytime.com/daily/prod_public_backup-$(TZ=America/Los_Angeles date -v -6H +%Y%m%d).gz"
93+
curl -f -o "prod-backup.gz" "https://backups.berkeleytime.com/public/daily/prod_public_backup-$(TZ=America/Los_Angeles date -v -6H +%Y%m%d).gz"
94+
printf "\033[33mNotice: Public backups are redacted and are not a comprehensive dataset. Use private backups (Cloudflare Access required) for full data.\033[0m\n"
9495

9596
# Copy the data and restore
9697
docker cp ./prod-backup.gz berkeleytime-mongodb-1:/tmp/prod-backup.gz

infra/base/templates/backup-prod-mongo-public.yaml

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,25 @@ spec:
3939
kubectl exec --namespace=bt \
4040
"$prod_pod" -- mongodump \
4141
--db bt \
42-
--collection courses \
43-
--collection classes \
44-
--collection sections \
45-
--collection terms \
46-
--collection gradeDistributions \
47-
--collection enrollmentHistories \
48-
--collection enrollmenttimeframes \
42+
--excludeCollection users \
43+
--excludeCollection plan \
44+
--excludeCollection planTerm \
45+
--excludeCollection label \
46+
--excludeCollection majorReq \
47+
--excludeCollection rating \
48+
--excludeCollection aggregatedMetrics \
49+
--excludeCollection schedules \
50+
--excludeCollection collections \
51+
--excludeCollection pods \
52+
--excludeCollection banners \
53+
--excludeCollection bannerviewcounts \
54+
--excludeCollection classviewcounts \
55+
--excludeCollection CuratedClass \
56+
--excludeCollection clickevents \
57+
--excludeCollection routeRedirects \
58+
--excludeCollection semester-roles \
59+
--excludeCollection staff-members \
60+
--excludeCollection targetedmessages \
4961
--archive=/tmp/prod_public_backup.gz \
5062
--gzip
5163
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# backups-worker – accessing backups
2+
3+
This Worker serves Mongo backups from R2 at `https://backups.berkeleytime.com`:
4+
5+
- `GET /public/*``prod-mongo-public-backups`
6+
- `GET /private/*``prod-mongo-backups`
7+
8+
Behavior notes:
9+
10+
- Only `GET` and `HEAD` are allowed.
11+
- `/private/*` requires Cloudflare Access, and the Worker cryptographically verifies `cf-access-jwt-assertion`.
12+
- Legacy public paths like `/daily/*` still resolve from the public bucket for compatibility.
13+
14+
## Required private-route auth config
15+
16+
Set these Worker variables for JWT verification:
17+
18+
- `CLOUDFLARE_ACCESS_TEAM_DOMAIN` (for example: `your-team.cloudflareaccess.com`)
19+
- `CLOUDFLARE_ACCESS_AUDIENCE` (Access app AUD tag; comma-separated if you need multiple values)
20+
21+
If either variable is missing or the token is invalid, `/private/*` returns `403`.
22+
23+
## 1. Install `cloudflared`
24+
25+
```bash
26+
brew install cloudflare/cloudflare/cloudflared
27+
```
28+
29+
## 2. Fetch a public backup
30+
31+
Public does **not** require authentication:
32+
33+
```bash
34+
curl -f -o "prod_public_backup-YYYYMMDD.gz" \
35+
"https://backups.berkeleytime.com/public/daily/prod_public_backup-YYYYMMDD.gz"
36+
printf "\033[33mNotice: Public backups are redacted and are not a comprehensive dataset. Use private backups (Cloudflare Access required) for full data.\033[0m\n"
37+
```
38+
39+
Replace `YYYYMMDD` with the date.
40+
41+
## 3. Log in for private backups
42+
43+
Private backups require Cloudflare Access.
44+
45+
```bash
46+
cloudflared access login https://backups.berkeleytime.com
47+
```
48+
49+
## 4. Fetch a private backup
50+
51+
After logging in:
52+
53+
```bash
54+
cloudflared access curl \
55+
"https://backups.berkeleytime.com/private/hourly/prod_backup-YYYYMMDDHH.gz" \
56+
-o "prod_backup-YYYYMMDDHH.gz"
57+
```
58+
59+
For monthly persistent backups:
60+
61+
```bash
62+
cloudflared access curl \
63+
"https://backups.berkeleytime.com/private/persistent/prod_backup-YYYYMMDD.gz" \
64+
-o "prod_backup-YYYYMMDD.gz"
65+
```

0 commit comments

Comments
 (0)