Deepseek Blackwell 优化

下一代 AI 架构为 DeepSeek-R1-FP4 模型提供前所未有的性能和效率，专为大型语言模型设计。

NVIDIA Blackwell Architecture Visualization

API 使用示例

使用这段简单的 Python 代码通过 TensorRT-LLM 部署 DeepSeek-R1-FP4：

from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM

def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(max_tokens=32)

    llm = LLM(model="nvidia/DeepSeek-R1-FP4", tensor_parallel_size=8, enable_attention_dp=True)

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
    main()

注意：此示例需要 8 个 B200 GPU，并且 TensorRT-LLM 需要从最新的主分支构建。

主要特点

DeepSeek-R1-FP4 模型: 为 NVIDIA Blackwell 架构优化的 DeepSeek AI R1 模型量化版本，将参数位数从 8 位减少到 4 位，同时保持性能。
TensorRT-LLM 优化: 利用 NVIDIA 的 TensorRT-LLM 进行高性能推理，在 Blackwell GPU 上实现高效部署，降低内存需求。
128K 上下文长度: 支持长达 128K 令牌的扩展上下文长度，能够全面分析长文档和对话，保持连贯性。
内存减少 1.6 倍: FP4 量化与 8 位模型相比，磁盘大小和 GPU 内存需求减少约 1.6 倍，实现更高效的部署。

常见问题

什么是 NVIDIA Blackwell？: NVIDIA Blackwell 是一种下一代 AI 架构，旨在为大型语言模型和其他 AI 工作负载提供前所未有的性能和效率。它是支持 DeepSeek-R1-FP4 模型的硬件平台。
什么是 DeepSeek-R1-FP4？: DeepSeek-R1-FP4 是为 NVIDIA Blackwell 架构优化的 DeepSeek AI R1 模型的量化版本。它使用 FP4 量化来减少内存需求，同时保持推理任务的高性能。
为什么使用 FP4 量化？: FP4 量化将每个参数的位数从 8 位减少到 4 位，从而使磁盘大小和 GPU 内存需求减少约 1.6 倍。这使得大型语言模型的部署更加高效，且不会显著降低性能。
如何部署 DeepSeek-R1-FP4 模型？: 该模型可以使用 TensorRT-LLM 在 NVIDIA Blackwell GPU 上部署。Hugging Face 模型页面上提供了部署的示例代码，您需要 8 个 B200 GPU，并且 TensorRT-LLM 需要从最新的主分支构建。