-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HETERO] support LLM and split model per available memory size #21764
[HETERO] support LLM and split model per available memory size #21764
Conversation
This PR will be closed in a week because of 2 weeks of no activity. |
c90c0e7
to
899896e
Compare
Co-authored-by: Wanglei Shen <[email protected]>
src/plugins/hetero/src/plugin.cpp
Outdated
if (device_mem_map[device_name] >= 1.2 * total_ops_size) { | ||
device_config[ov::query_model_ratio.name()] = 1.0f; | ||
} else if (device_mem_map["all_gpu_left"] >= 1.2 * total_ops_size || | ||
device_mem_map.find("CPU") != device_mem_map.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have GPU hardcodes in common logic? can it be generalized on common discrete device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, please review.
src/plugins/hetero/src/plugin.cpp
Outdated
@@ -77,23 +81,101 @@ ov::hetero::Plugin::DeviceProperties ov::hetero::Plugin::get_properties_per_devi | |||
return device_properties; | |||
} | |||
|
|||
bool ov::hetero::Plugin::get_device_memory_map(const std::vector<std::string>& device_names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for TODO. There is no unified API to get device memory. So this feature get memory of specific device with specific method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
src/plugins/hetero/src/plugin.cpp
Outdated
Configuration full_config{properties, m_cfg}; | ||
DeviceProperties properties_per_device = | ||
get_properties_per_device(full_config.device_priorities, full_config.get_device_properties()); | ||
|
||
// WARNING: Here is devices with user set priority | ||
auto device_names = ov::DeviceIDParser::get_hetero_devices(full_config.device_priorities); | ||
bool hetero_query_model_by_device = false; | ||
if (full_config.modelDistributionPolicy.count(ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL) != 0) { | ||
hetero_query_model_by_device = get_device_memory_map(device_names, available_device_mem_map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just skip device which cannot get device memory size here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
3bfe543
to
a3840b9
Compare
if (device_name.find("CPU") != std::string::npos) { | ||
// Assuming the CPU has enough memory | ||
available_device_mem_map["CPU"] = -1; | ||
} else if (device_name.find("GPU") != std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (device_name.find("GPU") != std::string::npos) { | |
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a device not GPU, we query memory size with ov::intel_gpu::device_total_mem_size
seems strange?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will return statistics of GPU memory allocated by engine for each allocation type, we can combined it with ov::intel_gpu::device_total_mem_size to check if we can obtain more accurate device available memory.
Based on our discussion, we will add this in the following optimization.
* @ingroup ov_runtime_cpp_prop_api | ||
*/ | ||
static constexpr Property<float, PropertyMutability::RW> query_model_ratio{"QUERY_MODEL_RATIO"}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about this property, seems should be an internal property, or you want to let user manipulate this ratio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an internal property now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this property to internal property, please review.
} | ||
} | ||
} | ||
if ((total_len < last_total_len || last_total_len == 0) && !cancel_split) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you paste perf data here on the benefit of this least data transmission searching effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…inotoolkit#21764) ### Details: - Support LLM inference |device| |--| |HETERO:CPU| |HETERO:GPU| |HETERO:CPU,GPU| |HETERO:GPU,CPU| |HETERO:GPU.0,GPU.1| |HETERO:GPU.0,GPU.1,CPU| |HETERO:GPU.0,GPU.1,GPU.2| - Use the `ov::hint::model_distribution_policy` property in [PR23077](openvinotoolkit#23077) - Use host memory in different subgraphs when input/output data exchange - Mask supported nodes and unsupported nodes to Subgraph in graph, and query model use subgraph, keep the model in query_model same as compile - Add a property `ov::query_model_ratio` to set the percentage of the model can be queried during query model - Improve performance on some LLM with large parameter by split the model to different devices, the number of split models should be smaller, reduce communication usage between multiple devices ### Tickets: - *CVS-133258* --------- Co-authored-by: Nadezhda <[email protected]> Co-authored-by: Shen, Wanglei <[email protected]> Co-authored-by: yanlan song <[email protected]>
…inotoolkit#21764) ### Details: - Support LLM inference |device| |--| |HETERO:CPU| |HETERO:GPU| |HETERO:CPU,GPU| |HETERO:GPU,CPU| |HETERO:GPU.0,GPU.1| |HETERO:GPU.0,GPU.1,CPU| |HETERO:GPU.0,GPU.1,GPU.2| - Use the `ov::hint::model_distribution_policy` property in [PR23077](openvinotoolkit#23077) - Use host memory in different subgraphs when input/output data exchange - Mask supported nodes and unsupported nodes to Subgraph in graph, and query model use subgraph, keep the model in query_model same as compile - Add a property `ov::query_model_ratio` to set the percentage of the model can be queried during query model - Improve performance on some LLM with large parameter by split the model to different devices, the number of split models should be smaller, reduce communication usage between multiple devices ### Tickets: - *CVS-133258* --------- Co-authored-by: Nadezhda <[email protected]> Co-authored-by: Shen, Wanglei <[email protected]> Co-authored-by: yanlan song <[email protected]>
Details:
Support LLM inference
Use the
ov::hint::model_distribution_policy
property in PR23077Use host memory in different subgraphs when input/output data exchange
Mask supported nodes and unsupported nodes to Subgraph in graph, and query model use subgraph, keep the model in query_model same as compile
Add a property
ov::query_model_ratio
to set the percentage of the model can be queried during query modelImprove performance on some LLM with large parameter by split the model to different devices, the number of split models should be smaller, reduce communication usage between multiple devices
Tickets: