We are currently writing a survey on Efficient LLM Agent Serving and welcome everyone to provide comments on the list!
This repository maintains a curated list of papers related to Large Language Model Based Agents (LLM Agents), especially focusing on efficient serving methods for LLM Agents.
This paper list covers several main aspects of efficient serving methods for LLM Agents. Table of content:
- Efficient-LLMAgent-Survey
- RelayAttention for Efficient Large Language Model Serving with Long System Prompts
- [ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching]
- Splitwise: Efficient generative LLM inference using phase splitting|ISCA'24
- [MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition]
- A Hardware Evaluation Framework for Large Language Model Inference|ISCA'24
- Efficient LLM Inference with Kcache
- HeteGen HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices | MLSys'24
- CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving|SIGCOMM '24
- Chatterbox: Robust Transport for LLM Token Streaming under Unstable Network
- [dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving]| OSDI '24
- [NetLLM: Adapting Large Language Models for Networking] | SIGCOMM '24
- [DistLLM: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving]| OSDI '24
- Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
- POLCACharacterizing Power Management Opportunities for LLMs in the Cloud | ASPLOS'24
- ScaleLLM: Unlocking Llama2-13B LLM Inference on Consumer GPU RTX 4090, powered by FEDML Nexus AI
- Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
- SpotServe: Serving Generative Large Language Models on Preemptible Instances | ASPLOS 2024
- PreActPreAct: Predicting Future in ReAct Enhances Agent's Planning Ability
- An LLM Compiler for Parallel Function Calling
- Dynamic Planning with a LLM
- Automatic and Efficient Customization of Neural Networks for ML Applications|OSDI ' 24
- ToolChain^*ToolChain^: Efficient Action Space Navigation in Large Language Models with A Search | ICLR'24
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments
- Efficient Tool Use with Chain-of-Abstraction Reasoning
- ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph
- Budget-Constrained Tool Learning with Planning
- Serverless LLM ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models | OSDI'24
- FaaSMem: Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture | ASPLOS'24
- Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering
- RET-LLM: Towards a General Read-Write Memory for Large Language Models
- Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning | ACL'23
- Memory Sandbox: Transparent and Interactive Memory Management for Conversational Agents
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
focus on improving the efficiency of data exchange and data transmission within AI agents: ---hongqiu
- LLM Multi-Agent Systems: Challenges and Open Problems
- Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents
- Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
- AIOS: LLM Agent Operating System
- AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System
- Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach
- Gorilla: Large Language Model Connected with Massive APIs
- Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
- AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System
- A Unified Debugging Approach via LLM-Based Multi-Agent Synergy
- Hybrid LLM Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing | ICLR'24
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Optimal Caching and Model Multiplexing for Large Model Inference | NeurIPS'23
- Optimising Calls to Large Language Models with Uncertainty Based Two-Tier Selection
- Octopus: On-device language model for function calling of software APIs
- Octopus v2: On-device language model for super agent | Stanford
- Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent| Stanford
- Octopus v4: Graph of language models
from here
Efficient Training | Efficient Inference | Efficient Fine-Tuning | |
---|---|---|---|
DeepSpeed [Code] | ✅ | ✅ | ✅ |
Megatron [Code] | ✅ | ✅ | ✅ |
Alpa [Code] | ✅ | ✅ | ✅ |
ColossalAI [Code] | ✅ | ✅ | ✅ |
FairScale [Code] | ✅ | ✅ | ✅ |
Pax [Code] | ✅ | ✅ | ✅ |
Composer [Code] | ✅ | ✅ | ✅ |
vLLM [Code] | ❌ | ✅ | ❌ |
TensorRT-LLM [Code] | ❌ | ✅ | ❌ |
LightLLM [Code] | ❌ | ✅ | ❌ |
OpenLLM [Code] | ❌ | ✅ | ✅ |
Ray-LLM [Code] | ❌ | ✅ | ❌ |
MLC-LLM [Code] | ❌ | ✅ | ❌ |
Sax [Code] | ❌ | ✅ | ❌ |
Mosec [Code] | ❌ | ✅ | ❌ |
LLM-Foundry [Code] | ✅ | ✅ | ❌ |
- Auto-GPT
- LangChain
- AutoGen
- Camel
- HuggingGPT
- GPT Engineer
- BabyAGI
- Al Town
- GPTeam
- ChatArena
- AgentVerse
- BurstGPTTowards Efficient and Reliable LLM Serving: A Real-World Workload Study
-
Mobile LLM MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases | Meta
-
BBox-Adapter: Lightweight Adapting for Black-Box Large Language Models
-
LLM as a System Service on Mobile Deviceshttps://arxiv.org/abs/2403.11805
- A Survey on Effective Invocation Methods of Massive LLM Services
- Personal llm agents: Insights and survey about the capability, efficiency and security
- LLM-Based Multi-Agent Systems for Software Engineering: Vision and the Road Ahead
- CASIT: Collective Intelligent Agent System for Internet of Things
- Understanding the Weakness of Large Language Model Agents within a Complex Android Environment
- The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
- Awesome MobileLLM
- Understanding the planning of LLM agents: A survey
- Awesome-LLM-Inference