-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练垂直领域大模型应该基于哪个版本? #177
Comments
对一些数据不大的领域,我们是用chat开始的 |
陈总,为什么数据不大就选择从chat开始呢? |
比如tigerbot预训练用了2.5TB token, 如果新的数据是这个的1%量级,那就是不大,如果是2成以上,那就是大。 从chat开始好处是保留的通用的问答和指令遵循能力,如果从base开始,那之前的chat tuning要重跑。 10B是什么大小, 10 billion token? |
是的,10B指的是10 billion token,我在wiki里看到“TigerBot-13B-base: 基于Llama-2-13B继续预训练300B tokens”,10/300 < 20%,那我们尝试下从chat开始 |
我们收集了垂直领域的预训练数据和指令数据(混合了通用数据),应该在tigerbot-base还是tigerbot-chat上二次开发呢?我看好像大家都是基于base做二次pt和sft,但是我不想浪费掉chat版本训的数据,基于哪个版本训效果更优呢?
The text was updated successfully, but these errors were encountered: