Skip to content

[问题/Issue] 章节7.2:给文档分块代码 gen_chunk #128

@casm1

Description

@casm1

1. 遇到问题的章节 / Affected Chapter

Chapeter7.0

2. 具体问题描述 / Problem Description

1、get_chunk函数,在添加覆盖内容时:

prev_chunk = chunk_text[-1]
cover_part = prev_chunk[-cover_content:] if len(prev_chunk) > cover_content else prev_chunk
chunk_part = cover_part + chunk_part

len(prev_chunk)统计的是分块文本的字符数,而cover_content表示需要覆盖的token数,这二者直接比较是否不正确?

2、同样是get_chunk函数,在代码的124行,计算curr_len = len(enc.encode(cover_part)) + 1 + line_len,由于token_len已经减去了需要覆盖的token数,那么计算curr_len的时候是否不需要加上覆盖的token数len(enc.encode(cover_part))

3. 问题重现材料 / Reproduction Materials

1、

prev_chunk = chunk_text[-1]
cover_part = prev_chunk[-cover_content:] if len(prev_chunk) > cover_content else prev_chunk
chunk_part = cover_part + chunk_part

建议修改为:

prev_chunk = chunk_text[-1]
prev_chunk_tokens = enc.encode(prev_chunk)
cover_part_tokens = prev_chunk_tokens[-cover_content:] if len(prev_chunk_tokens) > cover_content else prev_chunk_tokens
cover_part = enc.decode(cover_part_tokens)
chunk_part = cover_part + chunk_part

2、
curr_len = len(enc.encode(cover_part)) + 1 + line_len
建议修改为:
curr_len = 1 + line_len

确认事项 / Verification

  • 此问题未在过往Issue中被报告过 / This issue hasn't been reported before

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions