Yineng Zhang

About Me

I'm a Principal AI Researcher at Together AI and SGLang Core Maintainer.
I've initiated and led the end-to-end DeepSeek V3/R1 effort on SGLang — from day-0 support and performance optimization to large-scale EP deployment and GB200 NVL72 integration—driving roadmap, coordination, and execution across community collaborations that pushed the frontier of open-source inference engines.

My contributions to AI infrastructure have been recognized by the U.S. government with O-1A and EB-1A extraordinary ability classifications.
More about my work and background on my LinkedIn profile.

Projects

SGLang: SGLang is a fast serving framework for large language models and vision language models.

FlashInfer: FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels.

Interviews

The New York Times: DeepSeek’s Rise: How a Chinese Start-Up Went From Stock Trader to A.I. Star:

“Most of the team graduated from the top universities in China,” said Yineng Zhang, a lead software engineer at Baseten in San Francisco who works on the SGLang, a project not part of DeepSeek that helps people build on top of DeepSeek’s system. “They are very smart and very young.”

The New York Times: How Chinese A.I. Start-Up DeepSeek Is Competing With Silicon Valley Giants:

While employees at big Chinese technology companies are limited to collaborating with colleagues, “if you work on open source, you work with talent around the world,” said Yineng Zhang, lead software engineer at Baseten in San Francisco who works on the open source SGLang project. He helps other people and companies build products using DeepSeek’s system.

Selected Talks

Open Source Model Performance Optimization With SGLang
Featured Speaker at PyTorch Conference 2025

Youtube | Slide | Badge

SGLang: Open-Source Model Performance Optimization
Featured Speaker at AMD AI DevDay 2025

Youtube | Slide

Selected Publications

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze

MLSys 2025 Best Paper Award

Paper | Poster | Code | News

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu

ACM Transactions on Storage

Paper (ACM Digital Library) | Code