avatar

Yineng Zhang

me [at] zhyncs.com


About Me

I'm a Principal AI Researcher at Together AI and the creator and lead of TGL, the company’s proprietary inference engine. My journey with SGLang has evolved from one of the first core developers, to leading inference optimization efforts, and eventually taking on a builder role to support its next phase of growth. I have led major releases and technical blogs, such as Llama 3, DeepSeek V3, Large Scale EP, and GB200 NVL72. I am a selected member of the LMSYS Org and I am also a committer to FlashInfer, and co-authored the FlashInfer paper ( MLSys 2025 Best Paper). Previously, I was a Lead Software Engineer at Baseten, where I co-authored the DeepSeek V3 and Qwen 3 launch blogs. Earlier, I worked at Meituan on CTR GPU inference and vector retrieval systems.

Projects

SGLang: SGLang is a fast serving framework for large language models and vision language models, which has been adopted by AMD and xAI.

FlashInfer: FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels, which has been adopted by SGLang, vLLM and MLC LLM.

Interviews

The New York Times: DeepSeek’s Rise: How a Chinese Start-Up Went From Stock Trader to A.I. Star:
“Most of the team graduated from the top universities in China,” said Yineng Zhang, a lead software engineer at Baseten in San Francisco who works on the SGLang, a project not part of DeepSeek that helps people build on top of DeepSeek’s system. “They are very smart and very young.”

The New York Times: How Chinese A.I. Start-Up DeepSeek Is Competing With Silicon Valley Giants:
While employees at big Chinese technology companies are limited to collaborating with colleagues, “if you work on open source, you work with talent around the world,” said Yineng Zhang, lead software engineer at Baseten in San Francisco who works on the open source SGLang project. He helps other people and companies build products using DeepSeek’s system.

Latent Space: Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang): Baseten's Amir Haghighat and Yineng Zhang on DeepSeek V3, quantization, pricing strategies, SGLang, open source AI, and the three pillars of Mission Critical Inference

Talks

Introduction to LLM serving with SGLang: A hands-on session was delivered at the AI Engineer World's Fair 2025.

CUDA Tech Briefing at NVIDIA GTC 2025: A technical talk on SGLang was presented at NVIDIA GTC 2025, focusing on DeepSeek V3 optimization and the importance of CUDA JIT.
SGLang v0.4 Optimization: A technical talk on SGLang was delivered at CAMEL-AI Hackathon: Mastering Multi-Agent Systems.

SGLang Performance Optimization: A technical talk on SGLang was delivered at GPU MODE, which represents the world's largest GPU developer community.

Technical Blogs

  1. Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput
    Lead GB200 NVL72 project

  2. Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs
    Co lead optimization of DeepSeek V3/R1 on SGLang

  3. Day zero benchmarks for Qwen 3 with SGLang on Baseten
    Yineng Zhang, Michael Feil, Philip Kiely

  4. Private, secure DeepSeek-R1 in production in US & EU data centers
    Philip Kiely, Amir Haghighat, Yineng Zhang

  5. SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs
    Byron Hsu, Ke Bao, Lianmin Zheng, Yineng Zhang, Ziyi Xu

  6. SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD Instinct GPUs
    Michael Zhang, Hai Xiao, Hui Liu, Yineng Zhang

  7. SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision
    Ke Bao, Yineng Zhang, Liangsheng Yin, Kaichen Zhang, Bo Li, Ying Sheng

  8. Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)
    Liangsheng Yin, Yineng Zhang, Ying Sheng

  9. Meituan Waimai's Practice of Vector Retrieval System Based on GPU (a.k.a. 美团外卖基于 GPU 的向量检索系统实践)
    到家研发平台, 基础研发平台
    Yineng Zhang serves as the project lead.

Experience

Together AI
Principal AI Researcher
Lead TGL team
July 2025 - now

Baseten
Lead Software Engineer
Model Performance Team
September 2024 - June 2025

LMSYS Org
Team Member, Inference Lead for SGLang
July 2024 - now

Meituan
Senior Software Engineer
Machine Learning Engine Group
August 2021 - July 2024

Baidu
Senior Software Engineer
Baidu Speech
June 2020 - August 2021

Stealth Startup
Software Engineer
July 2019 - June 2020

Education

Jiangnan University
Bachelor of Engineering
September 2015 - June 2019

Publications

  1. The Measure of All Measures: Quantifying LLM Benchmark Quality
    Jihan Yao, Peter Jin, Ke Bao, Qiaolin Yu, Khushi Bhardwaj, Chang Su, Jialei Wang, YIKAI ZHU, Sugam Devare, Damon Mosk-Aoyama, Zhen Dong, Venkat Krishna Srinivasan, Yineng Zhang, Oleksii Kuchaiev, Jiantao Jiao, Banghua Zhu
    NeurIPS'25 LLM Evaluations Workshop
    Oral
  2. Locality-aware Fair Scheduling in LLM Serving
    Shiyi Cao*, Yichuan Wang*, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, Ion Stoica
    *indicates equal contribution
  3. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze
    MLSys 2025 Best Paper Award
    FlashInfer has been adopted by SGLang, vLLM and MLC LLM.
  4. QQQ: Quality Quattuor-Bit Quantization for Large Language Models
    Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, Wei Lin
    ICLR 2025 Workshop SCI-FM
    QQQ has been adopted by vLLM and torchao.

News

  1. May 8, 2025: FlashInfer has been selected for the Best Paper Award at MLSys 2025!
  2. Feb 11, 2025: FlashInfer has been accepted at MLSys 2025.
  3. Dec 26, 2024: The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one.
  4. Nov 25, 2024: SGLang has been the dominant large language model inference engine at AMD.
  5. Aug 24, 2024: SGLang has been adopted by xAI to power the Grok-2 model's inference.