-
Notifications
You must be signed in to change notification settings - Fork 97
Support per-row text_config via column reference for multilingual tables #299
Description
Problem Statement
When a table contains documents in multiple languages with a column indicating each row's language (a common pattern), there is currently no way to associate text_config with that column. The text_config index option only accepts a static configuration name, meaning every row in the index is tokenized with the same configuration. In native Postgres FTS, this is a well-established pattern using a regconfig column:
-- Native Postgres: per-row language from a column
CREATE INDEX ON documents USING gin(to_tsvector(language, content));Or via the built-in trigger tsvector_update_trigger_column, which reads the config name from a designated column per row.
With pg_textsearch, the only option today is a single fixed config per index. This means multilingual tables require workarounds that add complexity and operational overhead.
Proposed Solution
Allow text_config to reference a column name instead of (or in addition to) a static configuration string. The index build would then read the text search configuration from that column for each row, applying the appropriate stemming and stop word rules per document.
-- Proposed: text_config references a column containing the regconfig/language name
ALTER TABLE documents ADD COLUMN language text DEFAULT 'english';
CREATE INDEX ON documents USING bm25(content) WITH (text_config_column='language');
-- Each row is tokenized according to its own language value
INSERT INTO documents (content, language) VALUES
('PostgreSQL ist ein leistungsstarkes Datenbanksystem', 'german'),
('PostgreSQL est un système de base de données puissant', 'french'),
('PostgreSQL is a powerful database system', 'english');Query-side, the language would also need to be resolved per row for scoring. One option would be to infer it from the index metadata; another would be to require the query to specify the config to use for parsing the search terms (which is already effectively what happens today — the query is parsed with the index's text_config). A pragmatic first step could be to parse the query with a user-specified or default config, since query-side stemming differences between languages are often less impactful than index-side tokenization.
Alternatives Considered
- Partition by language with separate indexes per partition.
- Multiple BM25 indexes on the same table.
- Using text_config='simple'
All of them are somewhat suboptimal.