Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix parent child retrieval issues #12206

Merged
merged 24 commits into from
Jan 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
229e7a1
feat: add invalid document handling and improve modal close functiona…
WTW0313 Dec 26, 2024
9214cc8
fix: disabled retrieval setting in document settings page
douxc Dec 27, 2024
12b67df
fix: retrieval test trunk preview style update
douxc Dec 27, 2024
beb96db
fix: update rerank model error messages for clarity and consistency
WTW0313 Dec 27, 2024
5b150a0
fix: update Score component, return null if score is unavailable
douxc Dec 27, 2024
c09b46d
Update build-push.yml
douxc Dec 27, 2024
b0e302c
test:format code, trigger build
douxc Dec 27, 2024
9969e1e
Merge branch 'fix/parent-child-retrieval' of https://github.com/langg…
WTW0313 Dec 27, 2024
d1bb8ea
feat: add parent-child related parameters to dataset API doc
nite-knite Dec 28, 2024
91e814d
refactor: modify retrieval model configuration in settings modal
WTW0313 Dec 30, 2024
d876773
fix: enhance rerank model validation and configuration
WTW0313 Dec 30, 2024
b163b71
Merge branch 'main' into fix/parent-child-retrieval
WTW0313 Dec 30, 2024
14d34e6
fix: simplify click handler in ResultItem component
WTW0313 Dec 30, 2024
5c45ff7
fix: fix layout responsiveness issue with flex box
WTW0313 Dec 30, 2024
8382948
Merge branch 'main' into fix/parent-child-retrieval
WTW0313 Dec 30, 2024
6f1a7fd
fix: update translation for rerank model requirement and clean up SWR…
WTW0313 Dec 31, 2024
796a327
fix: remove unused rerank model variables and improve retrieval confi…
WTW0313 Jan 2, 2025
e2becb9
Merge branch 'main' into fix/parent-child-retrieval
WTW0313 Jan 2, 2025
fa60a3d
fix: simplify model selector logic in retrieval parameter configuration
WTW0313 Jan 2, 2025
3432c87
fix: update reranking model structure for consistency in retrieval co…
WTW0313 Jan 2, 2025
f01d0c2
fix: enhance document detail handling with invalidation logic for seg…
WTW0313 Jan 2, 2025
f0d66f9
Merge branch 'main' into fix/parent-child-retrieval
WTW0313 Jan 2, 2025
3443fec
fix: remove 'fix/parent-child-retrieval' branch from workflow triggers
WTW0313 Jan 2, 2025
89a8866
fix: update reranking model structure for improved clarity and consis…
WTW0313 Jan 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 34 additions & 1 deletion web/app/(commonLayout)/datasets/template/template.en.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,15 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>high_quality</code> High quality: embedding using embedding model, built as vector database index
- <code>economy</code> Economy: Build using inverted index of keyword table index
</Property>
<Property name='doc_form' type='string' key='doc_form'>
Format of indexed content
- <code>text_model</code> Text documents are directly embedded; `economy` mode defaults to using this form
- <code>hierarchical_model</code> Parent-child mode
- <code>qa_model</code> Q&A Mode: Generates Q&A pairs for segmented documents and then embeds the questions
</Property>
<Property name='doc_language' type='string' key='doc_language'>
In Q&A mode, specify the language of the document, for example: <code>English</code>, <code>Chinese</code>
</Property>
<Property name='process_rule' type='object' key='process_rule'>
Processing rules
- <code>mode</code> (string) Cleaning, segmentation mode, automatic / custom
Expand All @@ -65,6 +74,10 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>segmentation</code> (object) Segmentation rules
- <code>separator</code> Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- <code>max_tokens</code> Maximum length (token) defaults to 1000
- <code>parent_mode</code> Retrieval mode of parent chunks: <code>full-doc</code> full text retrieval / <code>paragraph</code> paragraph retrieval
- <code>subchunk_segmentation</code> (object) Child chunk rules
- <code>separator</code> Segmentation identifier. Currently, only one delimiter is allowed. The default is <code>***</code>
- <code>max_tokens</code> The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
</Property>
</Properties>
</Col>
Expand Down Expand Up @@ -155,6 +168,13 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>high_quality</code> High quality: embedding using embedding model, built as vector database index
- <code>economy</code> Economy: Build using inverted index of keyword table index

- <code>doc_form</code> Format of indexed content
- <code>text_model</code> Text documents are directly embedded; `economy` mode defaults to using this form
- <code>hierarchical_model</code> Parent-child mode
- <code>qa_model</code> Q&A Mode: Generates Q&A pairs for segmented documents and then embeds the questions

- <code>doc_language</code> In Q&A mode, specify the language of the document, for example: <code>English</code>, <code>Chinese</code>

- <code>process_rule</code> Processing rules
- <code>mode</code> (string) Cleaning, segmentation mode, automatic / custom
- <code>rules</code> (object) Custom rules (in automatic mode, this field is empty)
Expand All @@ -167,6 +187,10 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>segmentation</code> (object) Segmentation rules
- <code>separator</code> Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- <code>max_tokens</code> Maximum length (token) defaults to 1000
- <code>parent_mode</code> Retrieval mode of parent chunks: <code>full-doc</code> full text retrieval / <code>paragraph</code> paragraph retrieval
- <code>subchunk_segmentation</code> (object) Child chunk rules
- <code>separator</code> Segmentation identifier. Currently, only one delimiter is allowed. The default is <code>***</code>
- <code>max_tokens</code> The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
</Property>
<Property name='file' type='multipart/form-data' key='file'>
Files that need to be uploaded.
Expand Down Expand Up @@ -449,6 +473,10 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>segmentation</code> (object) Segmentation rules
- <code>separator</code> Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- <code>max_tokens</code> Maximum length (token) defaults to 1000
- <code>parent_mode</code> Retrieval mode of parent chunks: <code>full-doc</code> full text retrieval / <code>paragraph</code> paragraph retrieval
- <code>subchunk_segmentation</code> (object) Child chunk rules
- <code>separator</code> Segmentation identifier. Currently, only one delimiter is allowed. The default is <code>***</code>
- <code>max_tokens</code> The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
</Property>
</Properties>
</Col>
Expand Down Expand Up @@ -546,6 +574,10 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>segmentation</code> (object) Segmentation rules
- <code>separator</code> Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- <code>max_tokens</code> Maximum length (token) defaults to 1000
- <code>parent_mode</code> Retrieval mode of parent chunks: <code>full-doc</code> full text retrieval / <code>paragraph</code> paragraph retrieval
- <code>subchunk_segmentation</code> (object) Child chunk rules
- <code>separator</code> Segmentation identifier. Currently, only one delimiter is allowed. The default is <code>***</code>
- <code>max_tokens</code> The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
</Property>
</Properties>
</Col>
Expand Down Expand Up @@ -984,7 +1016,7 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
<Heading
url='/datasets/{dataset_id}/documents/{document_id}/segments/{segment_id}'
method='POST'
title='Update a Chunk in a Document '
title='Update a Chunk in a Document'
name='#update_segment'
/>
<Row>
Expand All @@ -1009,6 +1041,7 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>answer</code> (text) Answer content, passed if the knowledge is in Q&A mode (optional)
- <code>keywords</code> (list) Keyword (optional)
- <code>enabled</code> (bool) False / true (optional)
- <code>regenerate_child_chunks</code> (bool) Whether to regenerate child chunks (optional)
</Property>
</Properties>
</Col>
Expand Down
39 changes: 36 additions & 3 deletions web/app/(commonLayout)/datasets/template/template.zh.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,15 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>high_quality</code> 高质量:使用 embedding 模型进行嵌入,构建为向量数据库索引
- <code>economy</code> 经济:使用 keyword table index 的倒排索引进行构建
</Property>
<Property name='doc_form' type='string' key='doc_form'>
索引内容的形式
- <code>text_model</code> text 文档直接 embedding,经济模式默认为该模式
- <code>hierarchical_model</code> parent-child 模式
- <code>qa_model</code> Q&A 模式:为分片文档生成 Q&A 对,然后对问题进行 embedding
</Property>
<Property name='doc_language' type='string' key='doc_language'>
在 Q&A 模式下,指定文档的语言,例如:<code>English</code>、<code>Chinese</code>
</Property>
<Property name='process_rule' type='object' key='process_rule'>
处理规则
- <code>mode</code> (string) 清洗、分段模式 ,automatic 自动 / custom 自定义
Expand All @@ -63,8 +72,12 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>remove_urls_emails</code> 删除 URL、电子邮件地址
- <code>enabled</code> (bool) 是否选中该规则,不传入文档 ID 时代表默认值
- <code>segmentation</code> (object) 分段规则
- <code>separator</code> 自定义分段标识符,目前仅允许设置一个分隔符。默认为 \n
- <code>separator</code> 自定义分段标识符,目前仅允许设置一个分隔符。默认为 <code>\n</code>
- <code>max_tokens</code> 最大长度(token)默认为 1000
- <code>parent_mode</code> 父分段的召回模式 <code>full-doc</code> 全文召回 / <code>paragraph</code> 段落召回
- <code>subchunk_segmentation</code> (object) 子分段规则
- <code>separator</code> 分段标识符,目前仅允许设置一个分隔符。默认为 <code>***</code>
- <code>max_tokens</code> 最大长度 (token) 需要校验小于父级的长度
</Property>
</Properties>
</Col>
Expand Down Expand Up @@ -155,6 +168,13 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>high_quality</code> 高质量:使用 embedding 模型进行嵌入,构建为向量数据库索引
- <code>economy</code> 经济:使用 keyword table index 的倒排索引进行构建

- <code>doc_form</code> 索引内容的形式
- <code>text_model</code> text 文档直接 embedding,经济模式默认为该模式
- <code>hierarchical_model</code> parent-child 模式
- <code>qa_model</code> Q&A 模式:为分片文档生成 Q&A 对,然后对问题进行 embedding

- <code>doc_language</code> 在 Q&A 模式下,指定文档的语言,例如:<code>English</code>、<code>Chinese</code>

- <code>process_rule</code> 处理规则
- <code>mode</code> (string) 清洗、分段模式 ,automatic 自动 / custom 自定义
- <code>rules</code> (object) 自定义规则(自动模式下,该字段为空)
Expand All @@ -167,6 +187,10 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>segmentation</code> (object) 分段规则
- <code>separator</code> 自定义分段标识符,目前仅允许设置一个分隔符。默认为 \n
- <code>max_tokens</code> 最大长度(token)默认为 1000
- <code>parent_mode</code> 父分段的召回模式 <code>full-doc</code> 全文召回 / <code>paragraph</code> 段落召回
- <code>subchunk_segmentation</code> (object) 子分段规则
- <code>separator</code> 分段标识符,目前仅允许设置一个分隔符。默认为 <code>***</code>
- <code>max_tokens</code> 最大长度 (token) 需要校验小于父级的长度
</Property>
<Property name='file' type='multipart/form-data' key='file'>
需要上传的文件。
Expand Down Expand Up @@ -411,7 +435,7 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
<Heading
url='/datasets/{dataset_id}/documents/{document_id}/update-by-text'
method='POST'
title='通过文本更新文档 '
title='通过文本更新文档'
name='#update-by-text'
/>
<Row>
Expand Down Expand Up @@ -449,6 +473,10 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>segmentation</code> (object) 分段规则
- <code>separator</code> 自定义分段标识符,目前仅允许设置一个分隔符。默认为 \n
- <code>max_tokens</code> 最大长度(token)默认为 1000
- <code>parent_mode</code> 父分段的召回模式 <code>full-doc</code> 全文召回 / <code>paragraph</code> 段落召回
- <code>subchunk_segmentation</code> (object) 子分段规则
- <code>separator</code> 分段标识符,目前仅允许设置一个分隔符。默认为 <code>***</code>
- <code>max_tokens</code> 最大长度 (token) 需要校验小于父级的长度
</Property>
</Properties>
</Col>
Expand Down Expand Up @@ -508,7 +536,7 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
<Heading
url='/datasets/{dataset_id}/documents/{document_id}/update-by-file'
method='POST'
title='通过文件更新文档 '
title='通过文件更新文档'
name='#update-by-file'
/>
<Row>
Expand Down Expand Up @@ -546,6 +574,10 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>segmentation</code> (object) 分段规则
- <code>separator</code> 自定义分段标识符,目前仅允许设置一个分隔符。默认为 \n
- <code>max_tokens</code> 最大长度(token)默认为 1000
- <code>parent_mode</code> 父分段的召回模式 <code>full-doc</code> 全文召回 / <code>paragraph</code> 段落召回
- <code>subchunk_segmentation</code> (object) 子分段规则
- <code>separator</code> 分段标识符,目前仅允许设置一个分隔符。默认为 <code>***</code>
- <code>max_tokens</code> 最大长度 (token) 需要校验小于父级的长度
</Property>
</Properties>
</Col>
Expand Down Expand Up @@ -1009,6 +1041,7 @@ import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from
- <code>answer</code> (text) 答案内容,非必填,如果知识库的模式为 Q&A 模式则传值
- <code>keywords</code> (list) 关键字,非必填
- <code>enabled</code> (bool) false/true,非必填
- <code>regenerate_child_chunks</code> (bool) 是否重新生成子分段,非必填
</Property>
</Properties>
</Col>
Expand Down
Loading
Loading