Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use continuous batching by default #882

Open
wants to merge 80 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
ec5f305
Use continuous batching by default
Wovchena Sep 19, 2024
dd7a5cf
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Sep 19, 2024
229b7c5
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Sep 20, 2024
41d1fe7
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
36150c4
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
4a4a09e
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
1a58b5e
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
90d81e6
Reorder cout
Wovchena Sep 20, 2024
6dc43a3
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
03e2f32
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
e561e93
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
37ea2ad
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
b62aee9
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
07505b3
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
e078818
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
001d3a0
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
a0a964f
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 25, 2024
3cb2105
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 25, 2024
40ea516
Limit max new tokens.
popovaan Sep 25, 2024
193df7e
Fixed error
popovaan Sep 25, 2024
1704548
Clean up
Wovchena Sep 30, 2024
086c7b8
Default destructors
Wovchena Sep 30, 2024
607d90d
Merge branch 'master' into use-continuos-batching-by-default
Wovchena Sep 30, 2024
741c13b
Default ~PerfTime
Wovchena Sep 30, 2024
06d1b1e
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 10, 2024
8d7d39d
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
c4e8e05
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
8116342
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
b87d0f6
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
1806fa0
CB: fix deadlock (#71)
Wovchena Oct 11, 2024
c9dc107
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 12, 2024
4bbcd0e
Increase timeouts for tests
ilya-lavrenov Oct 12, 2024
743e018
Update causal_lm_cpp.yml
ilya-lavrenov Oct 12, 2024
cfccefa
Use split_core_complile_config for CB
ilya-lavrenov Oct 12, 2024
03965d6
Update causal_lm_cpp.yml
ilya-lavrenov Oct 12, 2024
784c331
Drop request if it's aborted by streamer
ilya-lavrenov Oct 13, 2024
93b8c38
Update src/cpp/src/continuous_batching_impl.cpp
ilya-lavrenov Oct 13, 2024
043d842
Drop request in case of exceptions, etc
ilya-lavrenov Oct 14, 2024
fdad63c
Turned off prefix caching
ilya-lavrenov Oct 14, 2024
a21f725
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
a66be9e
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
82fceb5
Update continuous_batching_impl.cpp
ilya-lavrenov Oct 14, 2024
a246c1c
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 14, 2024
4ee8f12
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 14, 2024
73a8872
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
4019678
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
ed7668e
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 14, 2024
feae546
Update causal_lm_cpp.yml
ilya-lavrenov Oct 14, 2024
5bdf779
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
e3f2949
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 16, 2024
f1a9ab5
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Oct 17, 2024
debbdd4
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 18, 2024
7827199
Apply suggestions from code review
ilya-lavrenov Oct 21, 2024
3de57d3
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 21, 2024
42d26df
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 22, 2024
467ab86
Apply suggestions from code review
ilya-lavrenov Oct 22, 2024
5b7f94a
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Oct 24, 2024
5a391a8
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Oct 30, 2024
b4a4174
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Nov 6, 2024
c3d55eb
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Nov 7, 2024
9fad1d2
Update linux.yml
ilya-lavrenov Nov 11, 2024
35f4ff2
Update windows.yml
ilya-lavrenov Nov 11, 2024
ad78839
Update mac.yml
ilya-lavrenov Nov 11, 2024
c5201e4
Update linux.yml
ilya-lavrenov Nov 11, 2024
d8d397a
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Nov 11, 2024
dc2673c
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Nov 11, 2024
4dd053c
Update llm_pipeline.cpp
ilya-lavrenov Nov 11, 2024
31ea070
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Nov 12, 2024
7847060
Update llm_pipeline.cpp
ilya-lavrenov Nov 12, 2024
d11db7e
Apply suggestions from code review
ilya-lavrenov Nov 12, 2024
9acf368
Update llm_pipeline.cpp
ilya-lavrenov Nov 12, 2024
3c835af
Update causal_lm_cpp.yml
ilya-lavrenov Nov 12, 2024
046c017
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Nov 12, 2024
badaa5b
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Nov 21, 2024
c807011
Fix validation
Wovchena Nov 22, 2024
7097164
Update linux.yml
ilya-lavrenov Nov 25, 2024
3538bbe
Update windows.yml
ilya-lavrenov Nov 25, 2024
9eb4176
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Nov 26, 2024
a510e77
Update linux.yml
ilya-lavrenov Nov 26, 2024
eb0b0f4
Update windows.yml
ilya-lavrenov Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 1 addition & 7 deletions src/cpp/src/continuous_batching_pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,7 @@ class ContinuousBatchingPipeline::Impl {
float m_matmul_time_ms = 0.0f;
float m_infer_total_ms = 0.0f;

~PerfTime() {
std::cout << "Inference requests aggregated statistic: " << std::endl;
std::cout << "Paged attention % of inference execution: " << (m_paged_attention_time_ms / m_infer_total_ms) * 100 << std::endl;
std::cout << "MatMul % of inference execution: " << (m_matmul_time_ms / m_infer_total_ms) * 100 << std::endl;
std::cout << "Total inference execution secs: " << m_infer_total_ms / 1000. << std::endl;
std::cout << std::endl;
}
~PerfTime() {}
} m_perf;

// current requests to process
Expand Down
26 changes: 26 additions & 0 deletions src/cpp/src/llm_pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,7 @@ ov::genai::LLMPipeline::LLMPipeline(
const ov::genai::Tokenizer& tokenizer,
OptionalGenerationConfig generation_config
) {
OPENVINO_THROW("Not supported");
auto start_time = std::chrono::steady_clock::now();
m_pimpl = std::make_unique<StatefulLLMPipeline>(request, tokenizer, generation_config);
auto stop_time = std::chrono::steady_clock::now();
Expand All @@ -527,12 +528,25 @@ ov::genai::LLMPipeline::LLMPipeline(
const std::string& device,
const ov::AnyMap& plugin_config
){
// std::cout << "Using continuous batching backend.\n";
auto start_time = std::chrono::steady_clock::now();
if (plugin_config.find(ov::genai::scheduler_config.name()) != plugin_config.end()) {
auto config_without_scheduler_config = plugin_config;
config_without_scheduler_config.erase(ov::genai::scheduler_config.name());
auto& scheduler_config = plugin_config.at(ov::genai::scheduler_config.name()).as<SchedulerConfig>();
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(model_path, tokenizer, scheduler_config, device, config_without_scheduler_config);
// std::cout << "Found custom SchedulerConfig.\n";
} else if (true) {
SchedulerConfig scheduler_config;
scheduler_config.num_kv_blocks = 64;
andrei-kochin marked this conversation as resolved.
Show resolved Hide resolved
scheduler_config.enable_prefix_caching = true;
andrei-kochin marked this conversation as resolved.
Show resolved Hide resolved
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(
model_path,
tokenizer,
scheduler_config,
device,
plugin_config
);
} else if ("NPU" == device) {
m_pimpl = std::make_unique<StaticLLMPipeline>(model_path, tokenizer, device, plugin_config);
} else {
Expand All @@ -547,12 +561,24 @@ ov::genai::LLMPipeline::LLMPipeline(
const std::string& device,
const ov::AnyMap& config
){
// std::cout << "Using continuous batching backend.\n";
auto start_time = std::chrono::steady_clock::now();
if (config.find(ov::genai::scheduler_config.name()) != config.end()) {
auto config_without_scheduler_config = config;
config_without_scheduler_config.erase(ov::genai::scheduler_config.name());
auto& scheduler_config = config.at(ov::genai::scheduler_config.name()).as<SchedulerConfig>();
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(path, scheduler_config, device, config_without_scheduler_config);
// std::cout << "Found custom SchedulerConfig.\n";
} else if (true) {
SchedulerConfig scheduler_config;
scheduler_config.num_kv_blocks= 64;
andrei-kochin marked this conversation as resolved.
Show resolved Hide resolved
scheduler_config.enable_prefix_caching = true;
andrei-kochin marked this conversation as resolved.
Show resolved Hide resolved
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(
path,
scheduler_config,
device,
config
);
} else if ("NPU" == device) {
m_pimpl = std::make_unique<StaticLLMPipeline>(path, device, config);
} else {
Expand Down
4 changes: 1 addition & 3 deletions src/cpp/src/timer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,5 @@ class ManualTimer {
m_total += std::chrono::duration<double, std::milli>(m_end - m_start).count();
}

~ManualTimer() {
std::cout << m_title << ": " << m_total / 1000. << " secs" << std::endl;
}
~ManualTimer() {}
};
7 changes: 6 additions & 1 deletion src/python/py_generate_pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include "openvino/genai/llm_pipeline.hpp"
#include <openvino/runtime/auto/properties.hpp>
#include "../cpp/src/tokenizers_path.hpp"
#include <pybind11/iostream.h>

#include "./utils.hpp"

Expand Down Expand Up @@ -433,14 +434,16 @@ PYBIND11_MODULE(py_generate_pipeline, m) {
m.doc() = "Pybind11 binding for LLM Pipeline";

py::class_<LLMPipeline>(m, "LLMPipeline", "This class is used for generation with LLMs")
.def(py::init([](
.def(py::init([&](
const std::string& model_path,
const std::string& device,
const std::map<std::string, py::object>& config
) {
ScopedVar env_manager(utils::ov_tokenizers_module_path());
return std::make_unique<LLMPipeline>(model_path, device, utils::properties_to_any_map(config));
}),
py::call_guard<py::scoped_ostream_redirect,
py::scoped_estream_redirect>(),
py::arg("model_path"), "folder with openvino_model.xml and openvino_tokenizer[detokenizer].xml files",
py::arg("device") = "CPU", "device on which inference will be done",
py::arg("config") = ov::AnyMap({}), "openvino.properties map",
Expand All @@ -460,6 +463,8 @@ PYBIND11_MODULE(py_generate_pipeline, m) {
ScopedVar env_manager(utils::ov_tokenizers_module_path());
return std::make_unique<LLMPipeline>(model_path, tokenizer, device, utils::properties_to_any_map(config));
}),
py::call_guard<py::scoped_ostream_redirect,
py::scoped_estream_redirect>(),
py::arg("model_path"),
py::arg("tokenizer"),
py::arg("device") = "CPU",
Expand Down
Loading