Skip to content

[Issue] 章节5:预训练模型代码和训练tokenizer代码 #115

@Yapeng-Gao

Description

@Yapeng-Gao

1. 遇到问题的章节 / Affected Chapter

chapter5.2 chapter5.3

2. 具体问题描述 / Problem Description

代码问题
1.日志太简略
2.train_tokenizer.py 代码做了优化
3.预训练时kmodel vocab_size和tokenizer没对齐,我训练出来的时8192,kmodel也需要对应时这个问题
4.ddp_pretrain.py 种没有warmup,我优化了代码
5.可以用deepspeed优化训练,我用的stage2+warmup,有完整代码
内容问题
用conda虚拟环境训练需要安装g++

train_tokenizer.py和ddp_pretrain.py 我都做了优化,可以给我权限,我提个pr吗

3. 问题重现材料 / Reproduction Materials

gcc: fatal error: cannot execute ‘cc1plus’: execvp: No such file or directory

虚拟环境要安装g++

vectorized_gather_kernel: Assertion ind >=0 && ind < ind_dim_size failed
索引越界,时kmodel中vocab_size=6144导致,和tokenizer对齐即可8192

确认事项 / Verification

  • 此问题未在过往Issue中被报告过 / This issue hasn't been reported before

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions