LLM Inference Servers
- GitHub - sgl-project/sglang: SGLang is yet another fast serving framework for large language models and vision language models.
- GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs
- GitHub - NVIDIA/TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
- GitHub - flashinfer-ai/flashinfer: FlashInfer: Kernel Library for LLM Serving
Triton-inference-server
ONNX Runtime
- https://github.com/microsoft/onnxruntime-inference-examples
- https://github.com/microsoft/DeepSpeed-MII
- https://github.com/microsoft/onnx-script
GitHub - open-mmlab/mmdeploy: OpenMMLab Model Deployment Framework