brefsearch-data

Data processing scripts for the brefsearch project.

This repository contains all the scripts and tools needed to extract subtitles from YouTube videos, generate thumbnails and animated previews, and push the data to Algolia for search indexing.

Installation

yarn install

Data Structure

{
  "episode": {
    "basename": "S01E65_brefJaiUnNouveauPote",
    "duration": "2:06",
    "id": "ZbGrzgcO68A",
    "index": 65,
    "name": "Bref. J'ai un nouveau pote.",
    "slug": "brefJaiUnNouveauPote",
    "viewCount": 9350261
  },
  "subtitle": {
    "content": "— Tu sais donc pas que c'est pas bien d'être raciste ?\n— Nom de Zeus, Marty !",
    "index": 13,
    "mostReplayedScore": 3,
    "start": 35
  },
  "media": {
    "height": 720,
    "lqip": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAA[…]XOfz/2JXQfoi2/FXAAAAAElFTkSuQmCC",
    "previewPath": "previews/S01E65_brefJaiUnNouveauPote/035.mp4",
    "thumbnailPath": "thumbnails/S01E65_brefJaiUnNouveauPote/035.png",
    "width": 1280
  }
}

Data Pipeline

Sources

Files in ./data/source/ are the source of truth from which the whole data will be re-generated when running yarn run data:generate

./data/source/episodes contains the episode metadata (title, index, YouTube Id). I don't expect this to change, ever.
./data/source/subtitles are subtitle files (in .vtt) format for all episodes. You're encouraged to update those files if you spot any typos.
./data/source/popularity contains files with popularity metrics (view count and most replayed heatmap) for each episode. That changes overtime, and should be updated regularly through yarn run data:update-popularity
./data/source/images is a symlink to the brefsearch-images repository, expected to be a sibling of this repo. It contains all the static thumbnails and animated previews used in the front-end. Its content should be regenerated with yarn run data:update-images whenever a timing is updated in the subtitles)

Generated

The files located in ./data/source/generated are snapshots of the data at any given time. Those files are generated from the various sources. You should not manually edit those files as any change will be overwritten by a call to yarn run data:generate.

Pushed

Running yarn run data:push-algolia will parse all files in ./data/generated and generate an array of records following the data structure documented above, and push them to an Algolia index.
Running yarn run data:push-images will push all images to a private server so they can be served by the front-end UI

Initial data dump

Before getting the repo in that state, I had to do a bit of manual processing. Below is a generic documentation of the process. I do not have scripts for you to redo the same thing automatically as it was a lot of manual one-off scripting.

I first "manually" downloaded the YouTube videos and audio files using yt-dlp:

# Download videos (for thumbnails/previews)
cd ./tmp/mp4 && yt-dlp "PLlFikkv2B2ffwYiFQJmcao3RKtw1DFMz5"

# Download audio only (for subtitle generation)
cd ./tmp/mp3 && yt-dlp --extract-audio "PLlFikkv2B2ffwYiFQJmcao3RKtw1DFMz5"

I named my files with the pattern Files are expected be named like: S01E75_brefJaiToutCasse.mp3

I uploaded my .mp3 files to HappyScribe to generate .vtt subtitle files. I saved them in data/source/subtitles/. I proofread them all by watching the episodes with VLC and the subtitles.

Related Repositories

brefsearch - Frontend website (Next.js)
brefsearch-images - Media assets repository

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
data		data
lib		lib
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.nvmrc		.nvmrc
.yarnrc.yml		.yarnrc.yml
LICENSE		LICENSE
README.md		README.md
debug.log		debug.log
eslint.config.js		eslint.config.js
lintstaged.config.js		lintstaged.config.js
package.json		package.json
prettier.config.js		prettier.config.js
vite.config.js		vite.config.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

brefsearch-data

Installation

Data Structure

Data Pipeline

Sources

Generated

Pushed

Initial data dump

Related Repositories

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

algolia/brefsearch-data

Folders and files

Latest commit

History

Repository files navigation

brefsearch-data

Installation

Data Structure

Data Pipeline

Sources

Generated

Pushed

Initial data dump

Related Repositories

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages