DeepSeek V3 is an advanced open-weight large language model (LLM) from China, which thanks to the Mixture of Experts (MoE) 🏭architecture is remarkably efficient and cost-conscious. Although it has a total of 671 billion parameters contains, during processing only 37 billion of these are active. This results in an excellent balance between computing power and resource savings.
Technical innovations such as Multi-Head Latent Attention (MLA) 🧠, FP8 mixed precision ⚡ and multi-token prediction further strengthen the model. Here are some highlights:
- Multi-Head Latent Attention (MLA) 🧩
DeepSeek V3 introduces MLA to optimize attention mechanisms. By compressing attention keys and values (Key-Value) to a lower dimension via down-projection and up-projection matrices, memory usage during inference is significantly reduced, while performance remains comparable to standard Multi-Head Attention. In addition, MLA applies Rotary Positional Embedding (RoPE) to amplify positional information. In Feed-Forward Networks (FFNs), DeepSeek V3 uses the DeepSeekMoE-architecture, which specifically selects experts based on token-to-expert affinity scores, ensuring a balanced expert distribution without additional loss functions.
- FP8 Mixed Precision ⚙️
Enables the model to train with 8-bit floating-point precision, increasing efficiency. The DeepSeek team has developed innovative load-balancing strategies and algorithmic improvements to overcome the computational limitations of H800 GPUs.
- Multi-Token Prediction 🔗
Improves coherence and contextual relevance when generating longer texts and complex output.
- Post-Training Enhancements
DeepSeek V3 additionally uses knowledge processing from the DeepSeek R1 model, which is known for its strong reasoning ability. Using synthetic data from R1 improves the reasoning quality of DeepSeek V3. Thus, DeepSeek V3 benefits from the advantages of advanced reasoning models without being a pure reasoning model itself.
DeepSeek V3 has performed in benchmarks such as. MMLU-Pro, MATH 500 and Codeforces shown strong results, even better than models such as GPT-4o. In addition, the model offers very competitive API pricing 💰, which makes it accessible to a wide range of applications.
This model looks promising and the increasing competition in the AI market is encouraging companies to further innovate and be more cost efficient. The hope is that the new DeepSeek model will also comply with GDPR legislation, allowing organizations within the EU to use it safely and responsibly.
Want to know more about DeepSeek V3? Read the article by my colleague Phylicia van Wieringen at DeepSeek puts the AI world on edge or check out deepseek.com to explore the functionalities and discover how this technology is driving further innovation and development within AI.