Complete API reference for ApexBase Python SDK.
- ApexClient - Main client class
- ResultView - Query results
- Constants - Module constants
- File Reading Table Functions - read_csv / read_parquet / read_json
- Set Operations - UNION / INTERSECT / EXCEPT
- Vector Search - topk_distance / batch_topk_distance / SQL explode_rename
The main entry point for ApexBase operations.
ApexClient(
dirpath: str = None,
batch_size: int = 1000,
drop_if_exists: bool = False,
enable_cache: bool = True,
cache_size: int = 10000,
prefer_arrow_format: bool = True,
durability: Literal['fast', 'safe', 'max'] = 'fast',
_auto_manage: bool = True
)Parameters:
dirpath: Data directory path (default: current directory)batch_size: Batch size for bulk operationsdrop_if_exists: If True, delete existing data on openenable_cache: Enable query result cachingcache_size: Maximum cache entriesprefer_arrow_format: Prefer Arrow format for internal transfersdurability: Persistence level - 'fast' (async), 'safe' (sync), 'max' (fsync every write)
Example:
from apexbase import ApexClient
# Basic usage
client = ApexClient("./data")
# With durability options
client = ApexClient("./data", durability="safe")
# Clean start (drop existing)
client = ApexClient.create_clean("./data")ApexBase supports multiple isolated databases within a single root directory. Each named database is stored as a subdirectory; the 'default' database maps to the root directory (backward-compatible).
use_database(database: str = 'default') -> ApexClientSwitch to a named database. Creates the database subdirectory if it does not exist. Resets the current table to None.
Parameters:
database: Database name.'default'(or'') maps to the root directory.
Returns: self (for method chaining)
Examples:
# Switch to analytics database
client.use_database("analytics")
# Switch back to default (root-level tables)
client.use_database("default")
# Method chaining
client.use_database("hr").create_table("employees")use(database: str = 'default', table: str = None) -> ApexClientSwitch to a named database and optionally select or create a table in one call. If table is specified and does not exist it is created automatically.
Parameters:
database: Database name (default = root-level).table: Table name to select. IfNone, only the database is switched.
Returns: self (for method chaining)
Examples:
# Switch database only
client.use(database="analytics")
# Switch database and select an existing table
client.use(database="analytics", table="events")
# Switch database and auto-create table if missing
client.use(database="new_db", table="new_table")
client.store({"key": "value"})list_databases() -> List[str]Return a sorted list of all available databases. 'default' is always included.
Example:
dbs = client.list_databases()
print(dbs) # ['analytics', 'default', 'hr']current_database: str # PropertyReturn the name of the currently active database. Returns 'default' when operating on root-level tables.
Example:
client.use_database("analytics")
print(client.current_database) # 'analytics'All SQL operations support the standard database.table qualified name syntax. The active database context only affects unqualified table references; qualified references always resolve to the correct database regardless of context.
Supported operations:
# SELECT across databases
client.execute("SELECT * FROM default.users")
client.execute("SELECT * FROM analytics.events WHERE cnt > 10")
# JOIN across databases
client.execute("""
SELECT u.name, e.event
FROM default.users u
JOIN analytics.events e ON u.id = e.user_id
""")
# INSERT into a different database
client.execute("INSERT INTO analytics.events (name, cnt) VALUES ('click', 1)")
# UPDATE in a different database
client.execute("UPDATE default.users SET age = 31 WHERE name = 'Alice'")
# DELETE from a different database
client.execute("DELETE FROM default.users WHERE age < 18")
# DDL across databases
client.execute("CREATE TABLE analytics.summary (total INT)")
client.execute("DROP TABLE IF EXISTS analytics.old_table")create_table(table_name: str, schema: dict = None) -> NoneCreate a new table, optionally with a pre-defined schema.
Parameters:
table_name: Name of the table to create.schema: Optional dict mapping column names to type strings. Pre-defining schema avoids type inference on the first insert, providing a performance benefit for bulk loading.
Supported types: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64, bool, string, binary
Examples:
# Without schema
client.create_table("users")
# With pre-defined schema
client.create_table("orders", schema={
"order_id": "int64",
"product": "string",
"price": "float64",
"paid": "bool"
})drop_table(table_name: str) -> NoneDrop a table and all its data.
Example:
client.drop_table("old_table")use_table(table_name: str) -> NoneSwitch to a different table for subsequent operations.
Example:
client.use_table("users")list_tables() -> List[str]Return list of all table names.
Example:
tables = client.list_tables()
print(tables) # ['users', 'orders']current_table: Optional[str] # PropertyGet the name of the currently active table. Returns None if no table is selected.
Example:
print(client.current_table) # 'users'store(data) -> NoneStore data in the active table. Requires a table to be selected via create_table() or use_table() first. Accepts multiple formats:
- Single dict:
{"name": "Alice", "age": 30} - List of dicts:
[{"name": "A"}, {"name": "B"}] - Dict of columns:
{"name": ["A", "B"], "age": [20, 30]} - pandas DataFrame
- polars DataFrame
- PyArrow Table
Examples:
# Single record
client.store({"name": "Alice", "age": 30})
# Multiple records
client.store([
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35}
])
# Columnar format (fastest for bulk)
client.store({
"name": ["David", "Eve"],
"age": [28, 32]
})from_pandas(df: pd.DataFrame, table_name: str = None) -> ApexClientImport data from pandas DataFrame. Returns self for chaining.
Parameters:
df: pandas DataFrame to importtable_name: Optional. If provided, auto-creates/selects the table before importing.
Example:
import pandas as pd
df = pd.DataFrame({"name": ["A", "B"], "age": [20, 30]})
client.from_pandas(df, table_name="users")from_polars(df: pl.DataFrame, table_name: str = None) -> ApexClientImport data from polars DataFrame. Returns self for chaining.
Parameters:
df: polars DataFrame to importtable_name: Optional. If provided, auto-creates/selects the table before importing.
Example:
import polars as pl
df = pl.DataFrame({"name": ["A", "B"], "age": [20, 30]})
client.from_polars(df, table_name="users")from_pyarrow(table: pa.Table, table_name: str = None) -> ApexClientImport data from PyArrow Table. Returns self for chaining.
Parameters:
table: PyArrow Table to importtable_name: Optional. If provided, auto-creates/selects the table before importing.
Example:
import pyarrow as pa
table = pa.table({"name": ["A", "B"], "age": [20, 30]})
client.from_pyarrow(table, table_name="users")execute(sql: str, show_internal_id: bool = None) -> ResultViewExecute SQL query and return results.
Parameters:
sql: SQL statement (SELECT, INSERT, etc.)show_internal_id: If True, include _id column in results
Example:
# Basic query (use your table name in FROM clause)
results = client.execute("SELECT * FROM users WHERE age > 25")
# Aggregation
results = client.execute("SELECT COUNT(*), AVG(age) FROM users")
count = results.scalar()
# With ordering and limits
results = client.execute("SELECT name, age FROM users ORDER BY age DESC LIMIT 10")query(
sql: str = None,
where_clause: str = None,
limit: int = None
) -> ResultViewQuery with WHERE expression (backward compatibility).
Example:
# WHERE expression only
results = client.query("age > 25")
results = client.query("name LIKE 'A%'")
# With limit
results = client.query(where_clause="city = 'NYC'", limit=100)retrieve(id_: int) -> Optional[dict]Get a single record by its internal _id.
Example:
record = client.retrieve(0)
print(record) # {'_id': 0, 'name': 'Alice', 'age': 30}retrieve_many(ids: List[int]) -> ResultViewGet multiple records by their internal _ids.
Example:
results = client.retrieve_many([0, 1, 2, 5])
df = results.to_pandas()retrieve_all() -> ResultViewGet all records from the current table.
Example:
results = client.retrieve_all()
print(len(results)) # Total row countcount_rows(table_name: str = None) -> intCount rows in a table.
Example:
count = client.count_rows()
count = client.count_rows("users") # Specific tablereplace(id_: int, data: dict) -> boolReplace a record by _id.
Example:
success = client.replace(0, {"name": "Alice", "age": 31})batch_replace(data_dict: Dict[int, dict]) -> List[int]Batch replace multiple records.
Example:
updated = client.batch_replace({
0: {"name": "Alice", "age": 31},
1: {"name": "Bob", "age": 26}
})delete(ids: Union[int, List[int]]) -> boolDelete record(s) by _id.
Example:
# Single delete
client.delete(5)
# Batch delete
client.delete([1, 2, 3])add_column(column_name: str, column_type: str) -> NoneAdd a new column to the current table.
Types: Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64, String, Bool
Example:
client.add_column("email", "String")
client.add_column("score", "Float64")drop_column(column_name: str) -> NoneDrop a column from the current table. Cannot drop _id column.
Example:
client.drop_column("temp_field")rename_column(old_column_name: str, new_column_name: str) -> NoneRename a column. Cannot rename _id column.
Example:
client.rename_column("email", "email_address")get_column_dtype(column_name: str) -> strGet the data type of a column.
Example:
dtype = client.get_column_dtype("age") # 'Int64'list_fields() -> List[str]List all column names in the current table.
Example:
fields = client.list_fields()
print(fields) # ['_id', 'name', 'age', 'city']FTS is implemented natively in Rust and available through all interfaces (Python API, PG Wire, Arrow Flight). The recommended way to manage and query FTS indexes is via SQL.
| Statement | Description |
|---|---|
CREATE FTS INDEX ON table (col1, col2) |
Create FTS index on specified columns |
CREATE FTS INDEX ON table |
Create FTS index on all string columns |
CREATE FTS INDEX ON table WITH (opt=val) |
Create with options |
DROP FTS INDEX ON table |
Drop index and delete files |
ALTER FTS INDEX ON table DISABLE |
Suspend indexing, keep files |
ALTER FTS INDEX ON table ENABLE |
Resume indexing and back-fill any missed rows |
SHOW FTS INDEXES |
List FTS-configured tables across all databases |
WHERE MATCH('query') |
Exact / ranked full-text search |
WHERE FUZZY_MATCH('query') |
Fuzzy / typo-tolerant search |
CREATE FTS INDEX
CREATE FTS INDEX ON table_name [(col1, col2, ...)] [WITH (lazy_load=bool, cache_size=N)](col1, col2)— optional column list; omit to index all string columnslazy_load— defer loading index into RAM until first search (defaultfalse)cache_size— LRU cache entries for the index (default10000)
client.execute("CREATE FTS INDEX ON articles (title, content)")
client.execute("CREATE FTS INDEX ON logs WITH (lazy_load=true, cache_size=50000)")DROP FTS INDEX
DROP FTS INDEX ON table_nameRemoves the index entry and deletes the .nfts index files from disk.
ALTER FTS INDEX ... DISABLE
ALTER FTS INDEX ON table_name DISABLESuspends FTS write-sync (INSERT / DELETE no longer update the index). Index files are kept on disk. Use ALTER FTS INDEX ... ENABLE to resume.
ALTER FTS INDEX ... ENABLE
ALTER FTS INDEX ON table_name ENABLEResumes FTS write-sync and back-fills all rows currently in the table, including any rows inserted while FTS was disabled.
SHOW FTS INDEXES
SHOW FTS INDEXESReturns a result set with columns: database, table, enabled, fields, lazy_load, cache_size. Lists indexes across the root directory and all named sub-databases.
result = client.execute("SHOW FTS INDEXES")
df = result.to_pandas()
# database table enabled fields lazy_load cache_size
# default articles True title, content False 10000MATCH('query')
Used in WHERE clauses to filter rows whose indexed text contains all query terms.
# Simple search
client.execute("SELECT * FROM articles WHERE MATCH('python tutorial')")
# Combined with other conditions
client.execute("""
SELECT title, content FROM articles
WHERE MATCH('machine learning') AND year >= 2023
ORDER BY _id DESC
LIMIT 10
""")
# Aggregations
client.execute("SELECT COUNT(*) FROM articles WHERE MATCH('rust')")FUZZY_MATCH('query')
Like MATCH() but tolerates typos and spelling variations.
client.execute("SELECT * FROM articles WHERE FUZZY_MATCH('pytohn')") # matches 'python'
client.execute("SELECT * FROM articles WHERE FUZZY_MATCH('databse')") # matches 'database'Note:
MATCH()/FUZZY_MATCH()require a FTS index to exist for the queried table. UseCREATE FTS INDEX ON tablefirst. The SQL interface works over all transports (Python, PG Wire, Arrow Flight).
init_fts(
table_name: str = None,
index_fields: Optional[List[str]] = None,
lazy_load: bool = False,
cache_size: int = 10000
) -> ApexClientInitialize full-text search for a table.
Parameters:
table_name: Table to index (default: current table)index_fields: Fields to index (None = all string fields)lazy_load: Load index on first searchcache_size: FTS cache size
Example:
client.init_fts(index_fields=["title", "content"])
client.init_fts(index_fields=["name"], lazy_load=True)search_text(query: str, table_name: str = None) -> np.ndarraySearch for documents containing query terms. Returns array of _ids.
Example:
ids = client.search_text("database")
print(ids) # array([0, 5, 10])fuzzy_search_text(
query: str,
min_results: int = 1,
table_name: str = None
) -> np.ndarrayFuzzy search tolerating typos. Returns array of _ids.
Example:
ids = client.fuzzy_search_text("databse") # Matches "database"search_and_retrieve(
query: str,
table_name: str = None,
limit: Optional[int] = None,
offset: int = 0
) -> ResultViewSearch and return full records.
Example:
results = client.search_and_retrieve("python")
results = client.search_and_retrieve("python", limit=10, offset=20)search_and_retrieve_top(
query: str,
n: int = 100,
table_name: str = None
) -> ResultViewReturn top N search results.
Example:
results = client.search_and_retrieve_top("important", n=5)get_fts_stats(table_name: str = None) -> DictGet FTS statistics.
Example:
stats = client.get_fts_stats()
print(stats) # {'fts_enabled': True, 'doc_count': 1000, 'term_count': 5000}disable_fts(table_name: str = None) -> ApexClientDisable FTS (keeps index files).
Example:
client.disable_fts()drop_fts(table_name: str = None) -> ApexClientDisable FTS and delete index files.
Example:
client.drop_fts()flush() -> NoneFlush all pending writes to disk.
Example:
client.flush()set_auto_flush(rows: int = 0, bytes: int = 0) -> NoneSet auto-flush thresholds.
Example:
client.set_auto_flush(rows=1000) # Flush every 1000 rows
client.set_auto_flush(bytes=1024*1024) # Flush every 1MBget_auto_flush() -> tupleGet current auto-flush configuration.
Example:
rows, bytes = client.get_auto_flush()estimate_memory_bytes() -> intEstimate current memory usage in bytes.
Example:
mem_bytes = client.estimate_memory_bytes()
print(f"Using {mem_bytes / 1024 / 1024:.1f} MB")close() -> NoneClose the client and release resources.
Example:
client.close()Container for query results with multiple output formats.
to_pandas(zero_copy: bool = True) -> pd.DataFrameConvert to pandas DataFrame.
Parameters:
zero_copy: Use ArrowDtype for zero-copy (pandas 2.0+)
Example:
df = results.to_pandas()
df = results.to_pandas(zero_copy=False) # Traditional NumPy typesto_polars() -> pl.DataFrameConvert to polars DataFrame.
Example:
df = results.to_polars()to_arrow() -> pa.TableConvert to PyArrow Table.
Example:
table = results.to_arrow()to_dict() -> List[dict]Convert to list of dictionaries.
Example:
records = results.to_dict()
for record in records:
print(record["name"])scalar() -> AnyGet single scalar value (for aggregate queries).
Example:
count = client.execute("SELECT COUNT(*) FROM users").scalar()first() -> Optional[dict]Get first row as dictionary.
Example:
row = results.first()get_ids(return_list: bool = False) -> Union[np.ndarray, List[int]]Get internal _ids from results.
Example:
ids = results.get_ids() # numpy array (default)
ids = results.get_ids(return_list=True) # Python listshape: tuple # (rows, columns)columns: List[str]# Length
len(results)
# Iteration
for row in results:
print(row)
# Indexing
first = results[0]
second = results[1]from apexbase import (
__version__, # Package version
FTS_AVAILABLE, # True (FTS always available)
ARROW_AVAILABLE, # True if pyarrow installed
POLARS_AVAILABLE, # True if polars installed
DurabilityLevel, # Type hint: Literal['fast', 'safe', 'max']
)ApexBase supports standard SQL for querying:
SELECT * FROM table
SELECT col1, col2 FROM table
SELECT col1 AS alias FROM table
SELECT DISTINCT col1 FROM table
SELECT * FROM table WHERE condition
SELECT * FROM table ORDER BY col DESC
SELECT * FROM table LIMIT 100
SELECT * FROM table LIMIT 100 OFFSET 10
SELECT * FROM table ORDER BY col LIMIT 100SELECT COUNT(*) FROM table
SELECT COUNT(DISTINCT col) FROM table
SELECT SUM(col), AVG(col), MAX(col), MIN(col) FROM tableWHERE col = value
WHERE col > value
WHERE col LIKE 'pattern%'
WHERE col IN (1, 2, 3)
WHERE col BETWEEN 10 AND 20
WHERE col IS NULL
WHERE col IS NOT NULL
WHERE condition1 AND condition2
WHERE condition1 OR condition2SELECT category, COUNT(*), AVG(price)
FROM products
GROUP BY category
SELECT category, COUNT(*)
FROM products
GROUP BY category
HAVING COUNT(*) > 10SELECT * FROM table1 JOIN table2 ON table1.id = table2.idApexBase provides SQL table functions that read external files directly in a FROM clause without importing data into a table first. All three functions return a result compatible with the full SQL engine — you can apply WHERE, GROUP BY, ORDER BY, LIMIT, JOIN, and UNION on top of them.
SELECT * FROM read_csv('path/to/file.csv')
SELECT * FROM read_csv('path/to/file.csv', header=true, delimiter=',')Parameters:
| Option | Default | Description |
|---|---|---|
header |
true |
Whether the first row is a header. Set to false / 0 if the file has no header row. |
delimiter / delim / sep |
, |
Field delimiter character. Use '\t' for TSV files. |
Schema inference: types are inferred automatically from the first 100 data rows (Int64, Float64, Bool, or String).
Examples:
# Read a comma-delimited CSV with header row
result = client.execute("SELECT * FROM read_csv('/data/sales.csv')")
# Tab-separated values (TSV)
result = client.execute("SELECT * FROM read_csv('/data/data.tsv', delimiter='\t')")
# No header row
result = client.execute("SELECT * FROM read_csv('/data/raw.csv', header=false)")
# Full SQL on top of the file
result = client.execute("""
SELECT city, COUNT(*) AS cnt, AVG(price)
FROM read_csv('/data/orders.csv')
WHERE price > 100
GROUP BY city
ORDER BY cnt DESC
LIMIT 10
""")
# Convert directly to DataFrame
df = client.execute("SELECT * FROM read_csv('/data/large.csv')").to_pandas()SELECT * FROM read_parquet('path/to/file.parquet')No options — schema is read directly from the Parquet file's metadata.
Examples:
# Read a Parquet file
result = client.execute("SELECT * FROM read_parquet('/data/events.parquet')")
# Projection and filter
result = client.execute("""
SELECT user_id, SUM(amount) AS total
FROM read_parquet('/data/transactions.parquet')
WHERE category = 'food'
GROUP BY user_id
""")
# Zero-copy to Arrow
table = client.execute("SELECT * FROM read_parquet('/data/wide.parquet')").to_arrow()SELECT * FROM read_json('path/to/file.json')Handles two formats automatically:
- NDJSON / JSON Lines — one JSON object per line (
.json,.jsonl,.ndjson) - pandas column-oriented JSON — output of
df.to_json(orient='columns')ororient='split'
No options — format is detected automatically.
Examples:
# Read NDJSON (one JSON object per line)
result = client.execute("SELECT * FROM read_json('/data/logs.ndjson')")
# Read pandas-exported JSON
result = client.execute("SELECT * FROM read_json('/data/export.json')")
# Apply aggregation
result = client.execute("""
SELECT status, COUNT(*) AS cnt
FROM read_json('/data/events.json')
GROUP BY status
ORDER BY cnt DESC
""")# JOIN a file read with a stored table
result = client.execute("""
SELECT u.name, f.score
FROM users u
JOIN read_csv('/data/scores.csv') f ON u.id = f.user_id
WHERE f.score > 90
""")
# UNION a file with a table
result = client.execute("""
SELECT name, city FROM users
UNION ALL
SELECT name, city FROM read_csv('/data/new_users.csv')
""")- All three functions use
mmapfor zero-copy file access. - CSV and JSON files are parsed in parallel (one chunk per CPU core via Rayon).
- Parquet files use parallel column decoding with shared metadata (zero re-parse overhead).
- Benchmarked against Polars on 1M rows: CSV 0.95×, NDJSON 0.93×, Parquet 1.33× (Arrow output).
SQL set operations combine result sets from two or more SELECT statements. Both sides must return the same number of columns.
Combines rows from both sides and removes duplicate rows.
SELECT col FROM table_a
UNION
SELECT col FROM table_bCombines rows from both sides and keeps all rows, including duplicates.
SELECT col FROM table_a
UNION ALL
SELECT col FROM table_bReturns only rows that appear in both result sets (deduplicated).
SELECT col FROM table_a
INTERSECT
SELECT col FROM table_bReturns rows from the left side that do not appear in the right side (deduplicated).
SELECT col FROM table_a
EXCEPT
SELECT col FROM table_bAppend ORDER BY, LIMIT, and OFFSET after the final set operation; they apply to the combined result:
SELECT val FROM a
UNION
SELECT val FROM b
ORDER BY val ASC
LIMIT 100# UNION: unique names across two tables
client.execute("""
SELECT name FROM customers
UNION
SELECT name FROM leads
ORDER BY name
""")
# UNION ALL: all rows from both tables (duplicates kept)
client.execute("""
SELECT name, city FROM domestic_users
UNION ALL
SELECT name, city FROM international_users
""")
# INTERSECT: users who appear in both the orders and wishlist tables
client.execute("""
SELECT user_id FROM orders
INTERSECT
SELECT user_id FROM wishlist
""")
# EXCEPT: users who placed orders but have no open support tickets
client.execute("""
SELECT user_id FROM orders
EXCEPT
SELECT user_id FROM support_tickets WHERE status = 'open'
""")
# Works with read_csv too
client.execute("""
SELECT email FROM users
EXCEPT
SELECT email FROM read_csv('/data/unsubscribed.csv')
""")| Operation | Duplicates | Rows returned |
|---|---|---|
UNION |
removed | left ∪ right (unique) |
UNION ALL |
kept | all rows from both sides |
INTERSECT |
removed | left ∩ right |
EXCEPT |
removed | left \ right |
ApexBase has a built-in vector similarity search engine implemented entirely in Rust with SIMD-accelerated distance kernels and an OS-level mmap scan buffer that is populated once and reused across queries. Both single-query and batch modes are available through dedicated Python methods and through a SQL extension syntax.
Vectors are stored as FixedList columns when inserted as numpy arrays (recommended) or as Binary columns when inserted as Python lists or tuples.
import numpy as np
from apexbase import ApexClient
client = ApexClient("./vecdb")
client.create_table("items")
# numpy arrays → FixedList column (optimal, zero-copy mmap scan)
client.store({
"label": ["a", "b", "c"],
"vec": [np.random.rand(128).astype(np.float32) for _ in range(3)],
})
# Python list/tuple → Binary column (also supported)
client.store({"label": "d", "vec": [0.1, 0.2, 0.3]})ApexBase supports half-precision (float16) vector columns for memory-efficient embedding storage. Each element is stored as a 16-bit IEEE 754 half-precision float, halving the memory and I/O footprint compared to float32 vectors.
CREATE TABLE embeddings (id TEXT, vec FLOAT16_VECTOR)Accepted type name aliases: FLOAT16_VECTOR, FLOAT16VECTOR, F16_VECTOR.
Use a batch store (two or more records in one call) for optimal encoding. Pass vectors as numpy arrays — any numeric dtype is accepted; the storage layer converts to float16 automatically.
import numpy as np
from apexbase import ApexClient
client = ApexClient("./vecdb")
client.execute("CREATE TABLE embeddings (label TEXT, vec FLOAT16_VECTOR)")
client.use_table("embeddings")
# float32 source data — auto-quantized to f16 on write
vecs = np.random.rand(1000, 128).astype(np.float32)
client.store([{"label": str(i), "vec": vecs[i]} for i in range(len(vecs))])
# float16 source data — stored directly
vecs_f16 = vecs.astype(np.float16)
client.store([{"label": str(i), "vec": vecs_f16[i]} for i in range(len(vecs_f16))])Note: Always use batch store (
len(data) > 1) forFLOAT16_VECTORcolumns. The single-record path converts vectors to float32 bytes before storage, which will produce incorrect distances on float16 columns. Batch the writes or use the columnar dict API:# Columnar dict — also correct and fastest client.store({"label": [str(i) for i in range(n)], "vec": [vecs[i] for i in range(n)]})
All topk_distance, batch_topk_distance, and SQL distance functions work transparently on FLOAT16_VECTOR columns. The query vector is always provided in float32/float64 — no special handling needed.
query = np.random.rand(128).astype(np.float32)
# TopK search
results = client.topk_distance("vec", query, k=10, metric="l2")
# SQL distance functions also work
client.execute("""
SELECT label, array_distance(vec, [0.1, 0.2, 0.3, 0.4]) AS dist
FROM embeddings
ORDER BY dist
LIMIT 5
""")Float16 distance kernels are hardware-accelerated on supported CPUs:
| Architecture | Feature | Kernels |
|---|---|---|
aarch64 (Apple M-series, AWS Graviton) |
fp16 NEON |
L2, Dot, Cosine, L1, L∞ — FCVTL/FCVTL2 |
x86_64 |
f16c + AVX2 |
L2, Dot, Cosine, L1, L∞ — _cvtph_ps |
| All others | scalar fallback | all metrics |
CPU feature detection is automatic at runtime — no build flags or environment variables required. On Apple Silicon (M1/M2/M3/M4), f16 kernels are typically ≥2× faster than equivalent float32 kernels.
Float16 has ~3 decimal digits of precision (machine epsilon ≈ 9.77 × 10⁻⁴). For unit vectors or typical embedding ranges [−1, 1], the relative distance error is under 0.2%.
import numpy as np
def f16_quantize(v):
return v.astype(np.float16).astype(np.float32)
vec = np.random.rand(128).astype(np.float32)
q = np.random.rand(128).astype(np.float32)
exact = float(np.sqrt(np.sum((vec - q) ** 2)))
f16_dist = float(np.sqrt(np.sum((f16_quantize(vec) - q) ** 2)))
print(f"relative error: {abs(exact - f16_dist) / exact:.2e}") # typically < 2e-3topk_distance(
col: str,
query,
k: int = 10,
metric: str = 'l2',
id_col: str = '_id',
dist_col: str = 'dist',
) -> ResultViewHeap-based nearest-neighbour search: O(n log k), significantly faster than ORDER BY distance LIMIT k for large tables.
Parameters:
col: Name of the vector column to search (FixedList or Binary).query: Query vector — list, tuple, or numpy array of floats.k: Number of nearest neighbours to return (default10).metric: Distance metric (see table below).id_col: Column name for the returned row IDs (default'_id').dist_col: Column name for the returned distances (default'dist').
Supported metrics:
metric value |
Aliases | Formula |
|---|---|---|
'l2' |
'euclidean' |
√Σ(aᵢ−bᵢ)² |
'l2_squared' |
— | Σ(aᵢ−bᵢ)² |
'l1' |
'manhattan' |
Σ |
'linf' |
'chebyshev' |
max |
'cosine' |
'cosine_distance' |
1 − (a·b)/(‖a‖‖b‖) |
'dot' |
'inner_product' |
−(a·b) (negated for min-heap) |
Returns: ResultView with id_col (Int64) and dist_col (Float64) columns, sorted nearest first.
Example:
import numpy as np
query = np.random.rand(128).astype(np.float32)
# L2 (default)
results = client.topk_distance('vec', query, k=10)
df = results.to_pandas()
# df columns: _id (int64), dist (float64)
# Cosine distance, custom column names
results = client.topk_distance('vec', query, k=5, metric='cosine',
id_col='item_id', dist_col='cosine_dist')
# Join back to the original table to retrieve full records
top_ids = results.get_ids() # numpy array of _id values
records = client.retrieve_many(top_ids.tolist())
# Or use a SQL subquery
client.execute("""
SELECT items.label, items.vec
FROM items
WHERE _id IN (
SELECT _id FROM (
SELECT explode_rename(topk_distance(vec, [0.1, 0.2, 0.3], 5, 'l2'), '_id', 'dist')
FROM items
)
)
""")batch_topk_distance(
col: str,
queries,
k: int = 10,
metric: str = 'l2',
) -> numpy.ndarrayBatch nearest-neighbour search — N query vectors in a single Rust call.
Why use this instead of calling topk_distance N times:
- The mmap float buffer (
scan_buf) is populated once regardless of N. - All N queries run in parallel via Rayon (outer parallelism over queries).
- The
_idcolumn is read only once.
Parameters:
col: Name of the vector column (FixedList or Binary).queries:(N, D)numpy array or array-like of query vectors (float32 or float64). A 1-D array is treated as a single query (N=1).k: Number of nearest neighbours per query (default10).metric: Distance metric — same values accepted astopk_distance.
Returns: numpy.ndarray of shape (N, k, 2), dtype float64.
result[i, j, 0]—_idof the j-th nearest neighbour for query i (cast toint64as needed).result[i, j, 1]— corresponding distance.- Each row is sorted ascending by distance.
- Entries padded with
(-1, inf)when fewer than k neighbours exist.
Example:
import numpy as np
N, D = 100, 128
queries = np.random.rand(N, D).astype(np.float32)
result = client.batch_topk_distance('vec', queries, k=10)
# result.shape == (100, 10, 2)
ids = result[:, :, 0].astype(np.int64) # shape (100, 10)
dists = result[:, :, 1] # shape (100, 10)
# Nearest neighbour for each query
nearest_id = ids[:, 0] # shape (100,)
nearest_dist = dists[:, 0] # shape (100,)
# Cosine similarity batch search
result_cos = client.batch_topk_distance('vec', queries, k=5, metric='cosine')Vector search is also available as a pure SQL expression. This is the form used internally by topk_distance() and is useful when composing larger SQL queries.
Syntax:
SELECT explode_rename(
topk_distance(col, [q1, q2, ..., qD], k, 'metric'),
'id_column_name',
'dist_column_name'
)
FROM table_namecol— vector column name.[q1, q2, ..., qD]— query vector as an array literal (float values).k— integer number of results.'metric'— distance metric string.- The two string arguments to
explode_renamename the output columns.
explode_rename "explodes" the TopK pairs returned by topk_distance into k rows with two named columns.
Examples:
# Basic: top 10 by L2 distance
results = client.execute("""
SELECT explode_rename(topk_distance(vec, [0.1, 0.2, 0.3], 10, 'l2'), '_id', 'dist')
FROM items
""")
df = results.to_pandas()
# df: _id (int64), dist (float64), 10 rows sorted nearest first
# Cosine distance with custom column names
results = client.execute("""
SELECT explode_rename(
topk_distance(vec, [1.0, 0.0, 0.0], 5, 'cosine'),
'item_id', 'cosine_dist'
)
FROM items
""")
# Dot product (inner product) search
results = client.execute("""
SELECT explode_rename(topk_distance(vec, [0.5, 0.5, 0.5], 20, 'dot'), '_id', 'score')
FROM embeddings
""")Note: The SQL
topk_distance/explode_renamesyntax requires the array literal[...]form for the query vector. To use dynamic vectors from Python, use thetopk_distance()method instead, which handles the formatting automatically.
Benchmark: 1M rows × dim=128, k=10, release build, warm mmap scan buffer.
| Metric | ApexBase | DuckDB | Speedup |
|---|---|---|---|
| L2 | ~12ms | ~47ms | 3.8× faster |
| Cosine | ~13ms | ~42ms | 3.1× faster |
| Dot | ~13ms | ~36ms | 2.8× faster |
All three metrics use a single scan of the mmap float buffer; distance computation is SIMD-accelerated.
INSERT INTO table (col1, col2) VALUES (1, 'a')
INSERT INTO table VALUES (1, 'a', 3.14)
INSERT INTO table (col1, col2) VALUES (1, 'a'), (2, 'b'), (3, 'c')ApexBase supports full SQL DDL operations.
CREATE TABLE table_name
CREATE TABLE IF NOT EXISTS table_name-- Add column
ALTER TABLE table_name ADD COLUMN column_name DATA_TYPE
-- Rename column
ALTER TABLE table_name RENAME COLUMN old_name TO new_name
-- Drop column
ALTER TABLE table_name DROP COLUMN column_nameDROP TABLE table_name
DROP TABLE IF EXISTS table_name| Type | Aliases | Description |
|---|---|---|
STRING |
VARCHAR, TEXT |
String/text data |
INT |
INTEGER, INT32, INT64 |
Integer numbers |
FLOAT |
DOUBLE, FLOAT64 |
Floating point numbers |
BOOL |
BOOLEAN |
Boolean values |
FLOAT16_VECTOR |
FLOAT16VECTOR, F16_VECTOR |
Half-precision float vector (SIMD-accelerated TopK) |
# Create table via SQL
client.execute("CREATE TABLE IF NOT EXISTS users")
# Add columns via SQL
client.execute("ALTER TABLE users ADD COLUMN name STRING")
client.execute("ALTER TABLE users ADD COLUMN age INT")
# Insert data via SQL
client.execute("INSERT INTO users (name, age) VALUES ('Alice', 30)")
results = client.execute("SELECT * FROM users WHERE age > 25")
# Modify schema via SQL
client.execute("ALTER TABLE users RENAME COLUMN name TO full_name")
client.execute("ALTER TABLE users DROP COLUMN age")
# Drop table via SQL
client.execute("DROP TABLE IF EXISTS users")You can execute multiple SQL statements in a single call by separating them with semicolons:
# Execute multiple statements at once
client.execute("""
CREATE TABLE IF NOT EXISTS products;
ALTER TABLE products ADD COLUMN name STRING;
ALTER TABLE products ADD COLUMN price FLOAT;
INSERT INTO products (name, price) VALUES ('Laptop', 999.99)
""")
# Multiple INSERT statements
client.execute("""
INSERT INTO products (name, price) VALUES ('Mouse', 29.99);
INSERT INTO products (name, price) VALUES ('Keyboard', 79.99);
INSERT INTO products (name, price) VALUES ('Monitor', 299.99)
""")
# The result of the last statement is returned
results = client.execute("""
CREATE TABLE IF NOT EXISTS temp;
INSERT INTO temp (name) VALUES ('test');
SELECT * FROM temp
""")Multi-Statement SQL Rules:
- Statements are separated by semicolons (
;) - Semicolons inside string literals are handled correctly
- Statements execute sequentially in order
- The result of the last SELECT statement is returned
- DDL statements (CREATE, ALTER, DROP, INSERT) return empty results