⑆ Bitcrusher API

Bitcrusher is a lightweight REST API that serves FastText word embeddings with full out-of-vocabulary (OOV) support via subword n-gram hashing — the same way FastText does it natively.

Overview

The FastText Common Crawl model (600B tokens) was quantized using IVF+PQ (inverted file index + product quantization) and stored in SQLite. At query time, vectors are reconstructed from PQ codes and IVF centroids in memory — no multi-gigabyte model file, no warmup delay, and a fraction of the RAM footprint of the raw float32 model.

OOV vectors are generated using the same FNV-1a subword n-gram algorithm as FastText, so results are consistent with the original implementation.

✱ Features

⚡ Instant Startup: Quantized data loads from SQLite once, then everything is served from memory
🗜️ IVF+PQ Quantization: Dramatically smaller memory footprint than raw float32 vectors
🎯 FastText Common Crawl: 600B-token model, 300-dimensional vectors
🔧 Full OOV Support: Subword n-gram hashing matches FastText's own OOV behavior
📦 Batch Endpoint: Up to 100 words per request

❚ API Endpoints

❙ Health Check

GET /health

Returns the service status. Indicates unhealthy until the quantized data finishes loading on startup.

❙ Single Word Vector

GET /v1/word/{word}

Returns the 300-dimensional vector for a single word. OOV words receive a generated vector via subword hashing.

Constraints: Max 128 characters. Word is lowercased and trailing punctuation stripped before lookup.

Response:

[0.1234, -0.5678, 0.9012, ...]

Error responses:

400 Bad Request — empty, whitespace, or too-long input
429 Too Many Requests — rate limit exceeded

❙ Batch Word Vectors

POST /v1/words
Content-Type: application/json

Returns vectors for multiple words in a single request. Duplicate words are deduplicated before lookup.

Constraints: Max 100 words per request, max 128 characters per word, max 32KB request body.

Request:

["hello", "world", "fasttext"]

Response:

[
  { "word": "hello",    "vector": [0.1234, -0.5678, 0.9012, ...] },
  { "word": "world",    "vector": [0.2345, -0.6789, 0.1123, ...] },
  { "word": "fasttext", "vector": [0.3456, -0.7890, 0.2234, ...] }
]

Error responses:

400 Bad Request — missing body, empty array, over limit, or invalid word
429 Too Many Requests — rate limit exceeded

❚ Rate Limits

All limits are per IP, sliding window (1 minute):

Endpoint	Limit
Global (all routes)	120 req/min
`GET /v1/word/{word}`	120 req/min
`POST /v1/words`	30 req/min

Rejected requests receive 429 Too Many Requests with a Retry-After: 60 header.

❚ Usage Examples

cURL

# Single word
curl https://bitcrusher.neobit.gg/v1/word/hello

# Batch request
curl -X POST https://bitcrusher.neobit.gg/v1/words \
  -H "Content-Type: application/json" \
  -d '["hello", "world", "fasttext"]'

JavaScript / TypeScript

// Single word
const res = await fetch('https://bitcrusher.neobit.gg/v1/word/hello');
const vector = await res.json(); // float[]

// Batch request
const res = await fetch('https://bitcrusher.neobit.gg/v1/words', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify(['hello', 'world', 'fasttext'])
});
const results = await res.json(); // { word: string, vector: float[] }[]

C#

using var client = new HttpClient();

// Single word
var vector = await client.GetFromJsonAsync<float[]>(
    "https://bitcrusher.neobit.gg/v1/word/hello");

// Batch request
var response = await client.PostAsJsonAsync(
    "https://bitcrusher.neobit.gg/v1/words",
    new[] { "hello", "world", "fasttext" });

record WordVector(string Word, float[] Vector);
var results = await response.Content.ReadFromJsonAsync<WordVector[]>();

⑇ Out-of-Vocabulary (OOV) Handling

When a word isn't in the vocabulary, Bitcrusher generates a vector using the FastText subword approach:

Boundary markers: Wraps the word as <word> (same as FastText)
N-gram extraction: Generates all character n-grams of length 3–6
FNV-1a hashing: Hashes each n-gram using FastText's own algorithm, bucketed into 2M slots
Subword lookup: Retrieves the quantized vector for each matching bucket
Averaging: Averages all found subword vectors into the final OOV vector

Words that are entirely punctuation or reduce to empty after normalization return a zero vector.

❚ Data Source & Licensing

Vectors from the FastText Common Crawl model trained by Facebook Research (600B tokens).

Contributing

Personal project, open for suggestions and improvements. Raise an issue or reach out at hey@edyburr.com.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Configuration		Configuration
Endpoints		Endpoints
Health		Health
Models		Models
Properties		Properties
Services		Services
Tools		Tools
.gitignore		.gitignore
Bitcrusher.csproj		Bitcrusher.csproj
Bitcrusher.sln		Bitcrusher.sln
Dockerfile		Dockerfile
Program.cs		Program.cs
README.md		README.md
appsettings.Development.json		appsettings.Development.json
appsettings.json		appsettings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⑆ Bitcrusher API

Overview

✱ Features

❚ API Endpoints

❙ Health Check

❙ Single Word Vector

❙ Batch Word Vectors

❚ Rate Limits

❚ Usage Examples

cURL

JavaScript / TypeScript

C#

⑇ Out-of-Vocabulary (OOV) Handling

❚ Data Source & Licensing

Contributing

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⑆ Bitcrusher API

Overview

✱ Features

❚ API Endpoints

❙ Health Check

❙ Single Word Vector

❙ Batch Word Vectors

❚ Rate Limits

❚ Usage Examples

cURL

JavaScript / TypeScript

C#

⑇ Out-of-Vocabulary (OOV) Handling

❚ Data Source & Licensing

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages