-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
1. 遇到问题的章节 / Affected Chapter
Chapeter7.0
2. 具体问题描述 / Problem Description
1、get_chunk函数,在添加覆盖内容时:
prev_chunk = chunk_text[-1]
cover_part = prev_chunk[-cover_content:] if len(prev_chunk) > cover_content else prev_chunk
chunk_part = cover_part + chunk_part
len(prev_chunk)统计的是分块文本的字符数,而cover_content表示需要覆盖的token数,这二者直接比较是否不正确?
2、同样是get_chunk函数,在代码的124行,计算curr_len = len(enc.encode(cover_part)) + 1 + line_len
,由于token_len已经减去了需要覆盖的token数,那么计算curr_len的时候是否不需要加上覆盖的token数len(enc.encode(cover_part))
?
3. 问题重现材料 / Reproduction Materials
1、
prev_chunk = chunk_text[-1]
cover_part = prev_chunk[-cover_content:] if len(prev_chunk) > cover_content else prev_chunk
chunk_part = cover_part + chunk_part
建议修改为:
prev_chunk = chunk_text[-1]
prev_chunk_tokens = enc.encode(prev_chunk)
cover_part_tokens = prev_chunk_tokens[-cover_content:] if len(prev_chunk_tokens) > cover_content else prev_chunk_tokens
cover_part = enc.decode(cover_part_tokens)
chunk_part = cover_part + chunk_part
2、
curr_len = len(enc.encode(cover_part)) + 1 + line_len
建议修改为:
curr_len = 1 + line_len
确认事项 / Verification
- 此问题未在过往Issue中被报告过 / This issue hasn't been reported before
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation