DeepSeek-V3

Released in late 2024, this model boasts 671 billion parameters and was trained on a dataset of 14.8 trillion tokens over approximately 55 days, costing around $5.58 million. Benchmark tests indicate that DeepSeek-V3 outperforms models like Llama 3.1 and Qwen 2.5, while matching the capabilities of GPT-4o and Claude 3.5 Sonnet. Its architecture employs a mixture of experts with a Multi-head Latent Attention Transformer, containing 256 routed experts and one shared expert, activating 37 billion parameters per token.

Leave a Reply

Your email address will not be published. Required fields are marked *