michal.i/o

❯

❯

❯

Server Inference

Server Inference

Jan 21, 20251 min read

LLM Inference Servers

GitHub - sgl-project/sglang: SGLang is yet another fast serving framework for large language models and vision language models.
- GitHub - sgl-project/sgl-learning-materials: Materials for learning SGLang
GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs
GitHub - NVIDIA/TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
GitHub - flashinfer-ai/flashinfer: FlashInfer: Kernel Library for LLM Serving
- [2501.01005] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Triton-inference-server

Serving a Torch-TensorRT model with Triton — Torch-TensorRT v1.4.0.dev0+d0af394 documentation

ONNX Runtime

https://github.com/microsoft/onnxruntime-inference-examples
https://github.com/microsoft/DeepSpeed-MII
https://github.com/microsoft/onnx-script

GitHub - open-mmlab/mmdeploy: OpenMMLab Model Deployment Framework

LLM Inference Servers
Triton-inference-server
ONNX Runtime

Backlinks

Decoder Transformer Inference (LLM Serving)

Graph View

Created with Quartz v4.4.0 © 2025