z-lab dflash

好的，这是您要求的英文文章的中文翻译，已按照规范处理格式和内容。

DFlash: 用于快速推测解码的块扩散模型 [论文](链接) | [博客](链接) | [模型](链接)

DFlash 是一个轻量级的块扩散模型，专为推测解码设计。它能够实现高效且高质量的并行草稿生成。

DFlash_demo.mp4

支持的模型

欢迎通过 GitHub Issue 请求支持更多模型。我们也将很快开源训练方案，以便您训练自己的 DFlash 草稿模型来加速任何 LLM。

📦 安装

请为每个后端使用独立的虚拟环境以避免冲突。

vLLM: vLLM v0.20.1+ 包含了核心的 DFlash 支持。对于大多数模型，使用标准安装方式：

uv pip install -e " .[vllm] "

Gemma4: DFlash 目前需要我们的临时 vLLM Gemma4 构建版本。推荐使用 Docker：

docker pull ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130

Gemma4 的源码回退方案：

uv pip install -U --torch-backend=auto \
  " vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head "

更新的非 Gemma4 SWA 草稿模型使用 SWA 支持分支：

uv pip install -U --torch-backend=auto \
  " vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head "

🚀 快速开始

vLLM

Gemma4 使用 Docker：

docker run --rm -it \
  --gpus all \
  --ipc=host \
  --shm-size=16g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130 \
  google/gemma-4-26B-A4B-it \
  --host 0.0.0.0 \
  --port 8000 \
  --speculative-config ' {"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"} ' \
  --attention-backend triton_attn \
  --max-num-batched-tokens 32768 \
  --trust-remote-code

非 Gemma4 模型：

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config ' {"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15} ' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
# 可选：启用调度重叠（实验性功能，可能不稳定）
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend trtllm_mha \
  --speculative-draft-attention-backend fa4 \
  --mem-fraction-static 0.75 \
  --mamba-scheduler-strategy extra_buffer \
  --trust-remote-code

Transformers

只有 Qwen3 和 LLaMA-3.1 模型支持 Transformers 后端。

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

draft = AutoModel.from_pretrained("z-lab/Qwen3-8B-DFlash-b16", trust_remote_code=True, dtype="auto", device_map="cuda:0").eval()
target = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(draft.device)

output = draft.spec_generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    temperature=0.0,
    target=target,
    stop_token_ids=[tokenizer.eos_token_id]
)

print(tokenizer.decode(output[0], skip_special_tokens=False))

MLX (Apple Silicon)

社区中已有许多优秀的 MLX 上的 DFlash 实现；我们在此提供一个简单高效的版本，已在 Apple M5 Pro 上使用 Qwen3、Qwen3.5 和 Gemma-4 模型测试通过。

from dflash.model_mlx import load, load_draft, stream_generate

model, tokenizer = load("Qwen/Qwen3.5-4B")
draft = load_draft("z-lab/Qwen3.5-4B-DFlash")

messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)

tps = 0.0
for r in stream_generate(model, draft, tokenizer, prompt, block_size=16, max_tokens=2048, temperature=0.6):
    print(r.text, end="", flush=True)
    tps = r.generation_tps

print(f"\n Throughput: {tps:.2f} tok/s")

📊 评估

所有基准测试共享相同的数据集 (gsm8k, math500, humaneval, mbpp, mt-bench)。数据集在首次运行时自动下载并缓存为 cache/ 目录下的 JSONL 文件。

vLLM:

python -m dflash.benchmark --backend vllm \
  --base-url http://127.0.0.1:8000 --model Qwen/Qwen3.5-27B \
  --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking

SGLang:

python -m dflash.benchmark --backend sglang \
  --base-url http://127.0.0.1:30000 --model Qwen/Qwen3.5-35B-A3B \
  --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking

Transformers (仅限 Qwen3 和 LLaMA):