Skip to content

tinyfish-io/bigset

Repository files navigation

BigSet

Live, queryable datasets that update automatically.

GitHub Stars License Issues Follow TinyFish


Think of it like a spreadsheet that fills itself in — you describe the dataset you want (YC companies currently hiring, insurance quotes in your area, restaurants serving a specific brand), and BigSet builds it, keeps it fresh, and lets you query it with SQL.

Built on TinyFish APIs.

✨ Why BigSet?

At the end of the day, the only thing that matters is data. Every decision, every agent, every product — it all comes down to having the right data at the right time.

So what if you could just… ask for it? Describe the dataset you want — in plain English — and have it built, structured, and kept fresh automatically. No scrapers to maintain. No pipelines to babysit. No waking up to broken cron jobs because some site changed a div.

You describe it. BigSet collects it. Your agents query it with SQL. It stays up to date on your schedule — every 30 minutes, every hour, whatever you need. And if something breaks, a healer agent patches it before you even notice.

Any dataset. Any source. Always fresh. That's the idea.


🚀 Quick Start

Prerequisites: Docker, Make, and a free Clerk account

1. Clone and set up Clerk

git clone https://github.com/tinyfish-io/bigset.git
cd bigset

Create a Clerk application at dashboard.clerk.com, then go to JWT Templates and enable the Convex template.

2. Configure env files

# Root .env — used by Docker for the frontend container
cp .env.example .env
# Fill in NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY and CLERK_SECRET_KEY

# Frontend .env.local — used by Next.js and Convex CLI
cp frontend/.env.example frontend/.env.local
# Fill in all three Clerk keys (publishable, secret, and JWT issuer domain)

Optional: to enable PostHog product analytics + session replay + error tracking, set NEXT_PUBLIC_POSTHOG_KEY and NEXT_PUBLIC_POSTHOG_HOST. Leave blank to disable cleanly (the app no-ops every event).

3. Start everything

make dev

This starts all Docker services, waits for Convex to be healthy, and deploys Convex functions automatically.

4. Generate Convex admin key (first time only)

docker compose exec convex ./generate_admin_key.sh

Paste the output into frontend/.env.local as CONVEX_SELF_HOSTED_ADMIN_KEY, then re-run make dev.

5. Load curated public datasets

The landing page and the dashboard's "Curated" section read from a set of 9 system-owned datasets. Load them with:

cd frontend
npx convex run publicSeed:seedPublicDatasets

The script is idempotent — rerunning it skips datasets that already exist (matched by a stable seedKey, so renaming a curated dataset never creates a duplicate). To add a 10th curated dataset, append it to PUBLIC_DATASETS in frontend/convex/publicSeed.ts with a fresh seedKey and rerun the command. To replace existing curated content in place, pass force: true:

npx convex run publicSeed:seedPublicDatasets '{"force":true}'

Open localhost:3500 and click Get started to sign in.

Note: Backend env needs no setup — backend/.env.example has correct defaults. If you edit Convex functions in frontend/convex/, run make convex-push to deploy the changes.


🛠 Tech Stack

Layer Tech
Frontend Next.js 16, React 19, Tailwind 4
Backend Fastify, TypeScript (agent runner)
Auth Clerk
Database Convex (self-hosted)
Data Collection TinyFish APIs (Search, Fetch, Browser)
Table view TanStack Table + react-window virtualization
Analytics PostHog — events, session replay, error tracking (optional)

📁 Project Structure

bigset/
├── frontend/            Next.js 16 — UI + Convex schema & functions
│   ├── convex/          Convex functions, schema, and auth config
│   └── .env.local       Clerk + Convex keys (not committed)
├── backend/             Fastify — agent runner, writes to Convex via HTTP
├── .env                 Clerk keys for docker-compose (not committed)
├── docker-compose.dev.yml
└── Makefile

🏗 Building in Public

BigSet is a work in progress. We're building in the open because the best ideas come from the people who actually want to use the thing.

We'd love your feedback, ideas, or help building — come say hi:

🤝 Contributing

Contributions are very welcome — whether it's code, feedback, or just telling us what datasets you'd want to build.

  1. Fork the repo
  2. Create a branch (git checkout -b my-feature)
  3. Make your changes
  4. Run bash scripts/verify-authz.sh to confirm the authorization layer still holds
  5. Open a PR

If you're not sure where to start, open an issue or come say hi.

📄 License

AGPL-3.0

About

What if you had all the data in the world?

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors