Skip to content

edyburr/bitcrusher

Repository files navigation

⑆ Bitcrusher API

Bitcrusher is a lightweight REST API that serves FastText word embeddings with full out-of-vocabulary (OOV) support via subword n-gram hashing — the same way FastText does it natively.

Overview

The FastText Common Crawl model (600B tokens) was quantized using IVF+PQ (inverted file index + product quantization) and stored in SQLite. At query time, vectors are reconstructed from PQ codes and IVF centroids in memory — no multi-gigabyte model file, no warmup delay, and a fraction of the RAM footprint of the raw float32 model.

OOV vectors are generated using the same FNV-1a subword n-gram algorithm as FastText, so results are consistent with the original implementation.

✱ Features

  • Instant Startup: Quantized data loads from SQLite once, then everything is served from memory
  • 🗜️ IVF+PQ Quantization: Dramatically smaller memory footprint than raw float32 vectors
  • 🎯 FastText Common Crawl: 600B-token model, 300-dimensional vectors
  • 🔧 Full OOV Support: Subword n-gram hashing matches FastText's own OOV behavior
  • 📦 Batch Endpoint: Up to 100 words per request

❚ API Endpoints

❙ Health Check

GET /health

Returns the service status. Indicates unhealthy until the quantized data finishes loading on startup.

❙ Single Word Vector

GET /v1/word/{word}

Returns the 300-dimensional vector for a single word. OOV words receive a generated vector via subword hashing.

Constraints: Max 128 characters. Word is lowercased and trailing punctuation stripped before lookup.

Response:

[0.1234, -0.5678, 0.9012, ...]

Error responses:

  • 400 Bad Request — empty, whitespace, or too-long input
  • 429 Too Many Requests — rate limit exceeded

❙ Batch Word Vectors

POST /v1/words
Content-Type: application/json

Returns vectors for multiple words in a single request. Duplicate words are deduplicated before lookup.

Constraints: Max 100 words per request, max 128 characters per word, max 32KB request body.

Request:

["hello", "world", "fasttext"]

Response:

[
  { "word": "hello",    "vector": [0.1234, -0.5678, 0.9012, ...] },
  { "word": "world",    "vector": [0.2345, -0.6789, 0.1123, ...] },
  { "word": "fasttext", "vector": [0.3456, -0.7890, 0.2234, ...] }
]

Error responses:

  • 400 Bad Request — missing body, empty array, over limit, or invalid word
  • 429 Too Many Requests — rate limit exceeded

❚ Rate Limits

All limits are per IP, sliding window (1 minute):

Endpoint Limit
Global (all routes) 120 req/min
GET /v1/word/{word} 120 req/min
POST /v1/words 30 req/min

Rejected requests receive 429 Too Many Requests with a Retry-After: 60 header.

❚ Usage Examples

cURL

# Single word
curl https://bitcrusher.neobit.gg/v1/word/hello

# Batch request
curl -X POST https://bitcrusher.neobit.gg/v1/words \
  -H "Content-Type: application/json" \
  -d '["hello", "world", "fasttext"]'

JavaScript / TypeScript

// Single word
const res = await fetch('https://bitcrusher.neobit.gg/v1/word/hello');
const vector = await res.json(); // float[]

// Batch request
const res = await fetch('https://bitcrusher.neobit.gg/v1/words', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify(['hello', 'world', 'fasttext'])
});
const results = await res.json(); // { word: string, vector: float[] }[]

C#

using var client = new HttpClient();

// Single word
var vector = await client.GetFromJsonAsync<float[]>(
    "https://bitcrusher.neobit.gg/v1/word/hello");

// Batch request
var response = await client.PostAsJsonAsync(
    "https://bitcrusher.neobit.gg/v1/words",
    new[] { "hello", "world", "fasttext" });

record WordVector(string Word, float[] Vector);
var results = await response.Content.ReadFromJsonAsync<WordVector[]>();

⑇ Out-of-Vocabulary (OOV) Handling

When a word isn't in the vocabulary, Bitcrusher generates a vector using the FastText subword approach:

  1. Boundary markers: Wraps the word as <word> (same as FastText)
  2. N-gram extraction: Generates all character n-grams of length 3–6
  3. FNV-1a hashing: Hashes each n-gram using FastText's own algorithm, bucketed into 2M slots
  4. Subword lookup: Retrieves the quantized vector for each matching bucket
  5. Averaging: Averages all found subword vectors into the final OOV vector

Words that are entirely punctuation or reduce to empty after normalization return a zero vector.

❚ Data Source & Licensing

Vectors from the FastText Common Crawl model trained by Facebook Research (600B tokens).

Contributing

Personal project, open for suggestions and improvements. Raise an issue or reach out at hey@edyburr.com.

About

RESTful API for global word vectors with full out-of-vocabulary support. Based on the FastText CommonCrawl dataset.

Resources

Stars

Watchers

Forks

Contributors