MoEs系列浅谈

By crabboss 2024 年 8 月 19 日

MoE的稀疏性：

1 传统的稠密模型，所有的参数都会被计算；
2 稀疏性允许整个系统的部分参数参与计算；

MoE的好处：

1 预训练更快；
2 推理时参数量更少，推理速度更快；

MoE的缺点：

1 需要更多的显存；
2 预训练，微调更不容易收敛；

MoE的关键部分：

1 门控网络
2 专家数量
3 负载均衡

1 Mixtral 8*7B

1 模型存在8个expert，每次激活2个；

2 DeepSeek-MoE

1 DeepSeek-MoE 16B-2.7B的效果与llama2 7B和Deepseek 7B差不多效果；
2 fine-grained expert segmentation：保持激活参数量一致的情况下，增加激活的expert数量，例如：16取2 -> 64取8，但是组合数量从120变成了442165368种；
3 shared experts isolation：设置一部分Shared Experts,每次推理时都激活；
4 64个专家，2个为共享专家，6个专家为选择激活；

3 Qwen1.5-MoE

1 fine-grained experts: 64个expert；
2 shared experts isolation: 4个共享专家，从另外60个experts中激活4个，共激活8个；

4 DeepSeek V2

1 fine-grained experts: 160个experts；
2 shared experts isolation: 2个共享专家，从另外160个experts激活6个，共激活8个；
3 MLA;
4 Decoupled RoPE;

5 Mixtral 8*22B

与mixtral 8*7B的架构一致；

6 负载均衡

Gshared和Switch Transformer的负载均衡公式用的比较多。

1 保持所有的专家都有均等的概率被选择；
2 保持所有的专家处理的token数量都差不多；

7 参考文献

By crabboss

Black-Box Prompt Optimization: Aligning Large Language Modelswithout Model Training

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

RAG和Long-Context的看法

大模型如何缓解微调过程的遗忘问题？