混合专家模型(Mixture of Experts,MoE)是一种神经网络模型架构,其特点是在模型中引入路由网络与专家子网络,进而代替原始的稠密网络。
混合专家模型
(Mixture of Experts)
作者:赵鑫 胡译文 陈志朋 文继荣
InfoBox:
中文名:混合专家模型
英文名:Mixture of Experts
学 科:大语言模型
定 义:混合专家模型(Mixture of Experts,MoE)是一种神经网络模型架构,其特点是在模型中引入路由网络与专家子网络,进而代替原始的稠密网络。在推理过程,MoE架构通过路由网络选择每次需要激活的专家子网络,仅激活其中部分专家完成给定任务。由于采用稀疏激活机制,混合专家模型与其性能相当的稠密模型相比,大幅减少了训练和推理过程的计算开销,使得在给定计算成本下扩展模型规模成为可能。
英文定义:A Mixture of Experts (MoE) model is a type of neural network architecture that introduces a routing network and multiple expert sub-networks to replace the traditional dense network. During inference, the MoE architecture uses the routing network to select which expert subnetworks to activate for each input, enabling only a subset of experts to process the task. Because of this sparse activation mechanism, MoE models significantly reduce the computational cost of training and inference compared to dense models with similar performance, making it possible to scale up model size under a given compute budget.
中文关键词:神经网络、模型架构、高效训练、高效推理
英文关键词:Neural Network, Network Architecture, Efficient Training, Efficient Inference
基本定义:
泛混合专家的理念早在 Transformer 出现之前就已提出。在1991 年,研究人员[1] 提出了一种将任务划分给不同子网络处理的架构,旨在减少不同任务之间的相互干扰。随后,研究工作 [2] 将混合专家结构引入LSTM网络,探索了其在循环神经网络中的应用。随着 Transformer 架构的兴起,混合专家作为一种高效的计算扩展方法,用于支持模型参数规模的扩展,并逐渐发展为一种主流的的大语言模型架构 [2,3,10,46,44]。
图1. 混合专家模型示例图 [10]
Transformer 结构通常由多个“注意力层”(Attention Layer)和“前馈网络层”(Feed-Forward Layer)交替堆叠组成。在序列建模任务中,注意力层主要负责时间维度的信息混合(time-mixing),是模型实现上下文学习的关键组件;而前馈网络层则专注于通道维度的信息整合(channel-mixing),承担着特征提取和记忆形成的核心功能[4]。尽管二者都表现出较强的稀疏性,但由于作用维度的差异,其稀疏形式存在显著区别。具体而言,注意力层在时间维度上呈现稀疏性:单个词元通常只需关注上下文中少量相关的其他词元[5];而前馈网络层则在参数维度上具有稀疏性,可视为激活键值记忆(key-value memory)的组合,其中仅有少数记忆单元被显著激活 [6,47]。这种激活模式在计算效率和内存利用率方面仍有优化空间。为此,现有研究[3]基于前馈网络的稀疏激活特性,将传统稠密前馈神经网络扩展为更大规模结构,并将其划分为多个可选择性激活的子网络,从而构建稀疏混合专家模块。通过控制激活专家数量,该模型能在保持训练和推理计算量基本不变的前提下,显著扩大参数规模。基于这一技术,Mixtral-8x7B [7]、DeepSeek-V3 [8]等混合专家模型取得了较大成功,后续也持续涌现出基于混合专家架构研发的大模型 [24,25]。
定义
在大模型背景下,混合专家模型特指采用混合专家模块替代传统前馈网络层的Transformer架构。该模块通常由路由网络(Router)和专家网络(Experts)两大核心组件构成:路由网络多采用线性层结合非线性映射(如Softmax或Sigmoid函数叠加Top-K运算)的结构,负责确定每个计算单元(如词元/token)的专家分配;每个专家则是独立的前馈网络层,部分研究也采用门控线性单元(Gated Linear Unit, GLU)作为替代方案。除了常规专家外,现有研究工作[9]还引入了共享专家(Shared Expert)机制——这些特定专家组件会用于处理所有计算数据,不受路由决策影响。在混合专家模型的演进过程中,研究人员提出了多种改进技术,如基于专家选择(Expert Choice)的路由策略 [11]、基于容量因子(Capacity Factor)的专家负载限制 [10]、基于模块定制化并行策略与通信压缩技术的并行机制 [21] 等。
值得注意的是,混合专家模块不仅可替代Transformer的前馈网络层,有些研究工作[38,39]还将其概念扩展至注意力层;在非Transformer架构(如混合线性注意力模型)中,同样能通过替换前馈网络层实现功能增强 [24,25,40,41],体现了该技术的通用性与灵活性。
发展
当前,混合专家模型的研究主要集中在两个方向:专家负载调控与训推基础设施优化。在模型训练过程中,词元在不同专家间的分配不均衡会显著影响训练效果与效率。因此,如何实现训练负载在专家间的均衡分配,已成为混合专家模型训练面临的核心挑战。另一方面,随着大模型参数规模的持续扩张,尽管混合专家模型的稀疏激活特性在并行计算中具有理论优势,但是在训练与推理基础设施的适配性方面还存在诸多实践难题。下面从这两方面介绍当前的主流技术进展。
针对专家负载不均衡问题,GShard [3] 和 Switch Transformers [10] 舍弃了文献 [2] 中提出的专家重要性辅助损失,仅保留专家均衡辅助损失。该损失函数通过约束微批次(micro-batch)内专家负载与理想均匀分配的偏离程度,促进更均衡的专家激活分布。由于计算专家负载分布时使用的Top-K函数不可导,优化过程中通常采用其近似的可导替代方案——即基于归一化路由logits的表达方式[12]。为进一步改善负载均衡,GShard 还引入了随机路由和专家容量等机制来处理异常情况。此外,ST-MoE [13] 提出了路由 z-loss 辅助损失,通过约束路由 logits 的数值范围从根源上缓解不均衡问题。但与其他辅助损失类似,z-loss 在权重较小时对均衡性提升有限,而权重过大又可能干扰语言建模目标的优化。为解决这一权重调节问题,文献[14] 提出了一种无需辅助损失(Auxiliary-Loss-Free)的正则化方法:在路由 logits 上引入基于专家分布统计信息的可更新偏置项,此部分参数无需通过反向传播进行更新;进一步,文献 [8] 在更大规模模型上验证了这种方法的有效性。除全局负载均衡外,文献 [8] 还提出了组内局部负载均衡策略,以适应多机分布式推理的实际需求。最新研究表明,辅助损失效果受限并非源于损失函数设计本身,而是均衡范围所带来的局限 [15]。现有方法通常在微批次内计算负载均衡损失,由于数据量不足,可能与实际训练分布有较大偏差。文献 [15] 指出,仅需将辅助损失计算范围从微批次扩展至全局批次(global-batch,一次反向传播所更新的全部样本),即可达到与无辅助损失方法相当甚至更优的效果。
在训练与推理基础设施方面,随着混合专家模型规模的持续扩大,如何实现大规模设备集群的高效训练成为关键挑战。GShard 针对专家并行设计了一套轻量级注释式 API 及 XLA 编译器扩展,成功在 2048 块 TPU v3 上完成了 600B 参数的混合专家模型训练 [3]。该并行框架随后演化为更通用的 GSPMD [26],支持各类模型的高效并行训练。后续研究如MegaBlocks [42] 和 FastMoE [43] 则提供了专门的混合专家模型计算与通信算子。现代大模型训练框架(如 Megatron-LM 和 OpenMoE)已提供开箱即用的专家并行实现 [33-34, 43],大幅降低了工程部署难度。此外,部分研究工作也聚焦于优化混合专家模型的通信效率,进一步提升了训推速度。文献 [21] 提出通信高效的并行机制:通过模块定制化并行策略与通信压缩技术,在千卡规模集群上显著提升了算力利用率。DeepEP [28] 开发了面向专家并行的通信库,支持高效的词元级 All-to-All 通信并兼容组内门控机制。文献 [29] 则进一步设计了细粒度通信-计算重叠机制,有效减少路由通信过程中的计算空泡,提升训练效率。针对传统方法中核函数(Kernel)启动次数过多导致的效率问题,文献 [27] 提出 FlashDMoE 融合核函数,将端到端的前向计算融合为单次核函数调用,显著提升了推理时算力利用率。此外,其他常见推理加速技术还包括专家剪裁 [30,32] 和专家并行负载均衡优化 [8]。
除负载均衡和训推基础设施优化外,近年来在路由机制与专家设计方面也涌现出大量创新研究。在路由机制方面,学者们探索了多种新型策略与架构设计。文献 [11] 改变了传统路由范式:不再由词元选择专家,而是由专家主动选择待处理的词元集合。但这种策略可能会导致部分词元被遗漏,并存在潜在的因果性泄漏风险 [14]。其他改进工作还包括基于层级GRU的动态路由机制 [16],以及自适应专家数量设计 [17-19],这些工作进一步提升了模型的灵活性与效率。在专家设计方面,研究人员也取得了重要进展:文献 [31] 发现专家间普遍存在表示冗余现象;文献 [9,22,23] 进一步探索了细粒度专家划分方法,结合文献 [8,15] 提出的负载均衡与共享专家方案,在提升模型性能的同时,有效缓解了专家冗余与负载不均衡问题。
未来展望
随着混合专家架构的持续演进,当前的研究重点正从基础结构设计和负载均衡逐渐转向四个关键方向:细粒度稀疏性机制、跨模块协同优化、软硬件协同设计以及推理加速技术。在细粒度稀疏激活方面,核心挑战在于如何在扩展模型参数规模的同时,有效降低实际激活参数数量。针对这一问题,UltraMem [35]和Memory++ [36]等架构提出了稀疏激活的记忆层模块,显著减少了计算冗余和内存消耗。进一步,跨模块协同优化呈现出多维稀疏融合的发展趋势。典型方法是在前馈网络的隐藏维度上引入稀疏激活机制[37],这是实现前馈网络稀疏化的另一种有效途径。在软硬件协同设计领域,新一代超节点(SuperPod)如英伟达GB200 NVL72和昇腾CloudMatrix 384凭借其高速互联和大容量显存,为模型架构创新提供了新的潜在进化可能。此外,vLLM[48]、SGLang [49]和LightLLM [20]等开源推理框架虽持续迭代优化 [8,50],但其实践效率相较OpenAI、DeepSeek 等顶级大模型公司研发的专有系统仍存在差距,这表明开源框架在大规模推理加速方面仍存在较大优化空间。
参考文献:
[1] R. A. Jacobs et al., “Adaptive Mixtures of Local Experts,” in Neural Computation, Mar. 1991, pp. 79–87.
[2] N. Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” in International Conference on Learning Representations, Feb. 2017.
[3] D. Lepikhin et al., “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,” in International Conference on Learning Representations, Oct. 2020.
[4] W. Yu et al., “MetaFormer is Actually What You Need for Vision,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp. 10809–10819.
[5] R. Child et al., “Generating Long Sequences with Sparse Transformers,” Apr. 2019, arXiv:1904.10509.
[6] M. Geva et al., “Transformer Feed-Forward Layers Are Key-Value Memories,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Nov. 2021, pp. 5484–5495.
[7] A. Q. Jiang et al., “Mixtral of Experts,” Jan. 2024, arXiv:2401.04088.
[8] DeepSeek-AI et al., “DeepSeek-V3 Technical Report,” Feb. 2025, arXiv:2412.19437.
[9] D. Dai et al., “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,” Jan. 2024, arXiv:2401.06066.
[10] W. Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models with Simple and E?cient Sparsity,” in The Journal of Machine Learning Research, Jan. 2022, pp. 5232–5270.
[11] Y. Zhou et al., “Mixture-of-Experts with Expert Choice Routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, Nov. 2022, pp. 7103–7114.
[12] J. Su, “MoE环游记:2、不患寡而患不均,” in Scientific Spaces, Feb. 2025.
[13] B. Zoph et al., “ST-MoE: Designing Stable and Transferable Sparse Expert Models,” Apr. 2022, arXiv:2202.08906.
[14] L. Wang et al., “Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts,” Oct. 2024, arXiv:2408.15664.
[15] Z. Qiu et al., “Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models,” Feb. 2025, arXiv:2501.11873.
[16] Z. Qiu et al., “Layerwise Recurrent Router for Mixture-of-Experts,” in The Thirteenth International Conference on Learning Representations, Oct. 2024.
[17] Z. Zeng et al., “AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models,” in Findings of the Association for Computational Linguistics: EMNLP 2024, Nov. 2024, pp. 6223–6235.
[18] Z. Wang et al., “ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing,” in The Thirteenth International Conference on Learning Representations, Oct. 2024.
[19] P. Jin et al., “MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts,” in The Thirteenth International Conference on Learning Representations, Oct. 2024.
[20] R. Gong et al., “Past-Future Scheduler for LLM Serving under SLA Guarantees,” in Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2025, pp. 798–813.
[21] C. Jin et al., “MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production,” May 2025, arXiv:2505.11432.
[22] J. Li et al., “CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, Nov. 2024, pp. 131224–131246.
[23] J. Ludziejewski et al., “Scaling Laws for Fine-Grained Mixture of Experts,” in Proceedings of the 41st International Conference on Machine Learning, Jul. 2024, pp. 33270–33288.
[24] MiniMax, “MiniMax-01: Scaling Foundation Models with Lightning Attention,” Jan. 2025, arXiv:2501.08313.
[25] T. H. Team et al., “Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought,” Jul. 2025, arXiv:2505.15431.
[26] Y. Xu et al., “GSPMD: General and Scalable Parallelization for ML Computation Graphs,” Dec. 2021, arXiv:2105.04663.
[27] O. J. Aimuyo et al., “FlashDMoE: Fast Distributed MoE in a Single Kernel,” Jun. 2025, arXiv:2506.04667.
[28] S. Zhang et al., “Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts,” Mar. 2025, arXiv:2502.19811.
[29] DeepSeek-AI, “DeepEP: an efficient expert-parallel communication library,” in GitHub, Feb. 2025.
[30] Z. Dong et al., “Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations,” May 2025, arXiv:2504.06792.
[31] Z.-F. Gao et al., “Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models,” in Proceedings of the 29th International Conference on Computational Linguistics, Oct. 2022, pp. 3263–3273.
[32] X. Lu et al., “Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Aug. 2024, pp. 6159–6172.
[33] M. Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,” Sep. 2019, arXiv:1909.08053.
[34] D. Narayanan et al., “Efficient large-scale language model training on GPU clusters using megatron-LM,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021, pp. 1–15.
[35] Z. Huang et al., “Ultra-Sparse Memory Network,” Feb. 2025, arXiv:2411.12364.
[36] V.-P. Berges et al., “Memory Layers at Scale,” Dec. 2024, arXiv:2412.09764.
[37] Y. Chen et al., “Mixture of Hidden-Dimensions Transformer,” Dec. 2024, arXiv:2412.05644.
[38] R. Csordás et al., “SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, Nov. 2024, pp. 74411–74438.
[39] P. Jin et al., “MoH: Multi-Head Attention as Mixture-of-Head Attention,” Oct. 2024, arXiv:2410.11842.
[40] W. Sun et al., “Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts,” in First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models, Mar. 2025.
[41] J. Du et al., “MoM: Linear Sequence Modeling with Mixture-of-Memories,” Feb. 2025, arXiv:2502.13685.
[42] T. Gale et al., “MegaBlocks: Efficient Sparse Training with Mixture-of-Experts,” in Proceedings of Machine Learning and Systems, Mar. 2023, pp. 288–304.
[43] F. Xue et al., “OpenMoE: an early effort on open mixture-of-experts language models,” in Proceedings of the 41st International Conference on Machine Learning, Jul. 2024, pp. 55625–55655.
[44] N. Du et al., “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts,” in Proceedings of the 39st International Conference on Machine Learning, Jul. 2022, pp. 5547–5569.
[45] DeepSeek-AI, “Day 6: One More Thing, DeepSeek-V3/R1 Inference System Overview, ” in Github, Mar. 2025.
[46] W. X. Zhao et al., “A Survey of Large Language Models,” Mar. 2025, arXiv:2303.18223.
[47] Z. Zhang et al., “ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs,” Feb. 2024, arXiv:2402.03804.
[48] W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” in Proceedings of the 29th Symposium on Operating Systems Principles, Oct. 2023, pp. 611–626.
[49] L. Zheng et al., “SGLang: efficient execution of structured language model programs,” in Proceedings of the 38th International Conference on Neural Information Processing Systems, Jun. 2025, pp. 62557–62583.
[50] DeepSeek-AI., “DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling,” in GitHub, Feb. 2025.
作者简介
赵鑫
中国人民大学,教授
胡译文
中国人民大学,博士研究生
陈志朋
中国人民大学,博士研究生
文继荣
中国人民大学,教授
计算机术语审定委员会及术语平台介绍:
计算机术语审定委员会(Committee on Terminology)主要职能为收集、翻译、释义、审定和推荐计算机新词,并在CCF平台上宣传推广。这对厘清学科体系,开展科学研究,并将科学和知识在全社会广泛传播,都具有十分重要的意义。术语众包平台CCFpedia的建设和持续优化,可以有效推进中国计算机术语的收集、审定、规范和传播工作,同时又能起到各领域规范化标准定制的推广作用。新版的CCFpedia计算机术语平台(http://term.ccf.org.cn)将术语的编辑运营与浏览使用进行了整合,摒弃老版中跨平台操作的繁琐步骤,在界面可观性上进行了升级,让用户能够简单方便地查阅术语信息。同时,新版平台中引入知识图谱的方式对所有术语数据进行组织,通过图谱多层关联的形式升级了术语浏览的应用形态。
点击“阅读原文”,加入CCF。
