Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HETERO] support LLM and split model per available memory size #21764

Merged

Conversation

WeldonWangwang
Copy link
Contributor

@WeldonWangwang WeldonWangwang commented Dec 19, 2023

Details:

  • Support LLM inference

    device
    HETERO:CPU
    HETERO:GPU
    HETERO:CPU,GPU
    HETERO:GPU,CPU
    HETERO:GPU.0,GPU.1
    HETERO:GPU.0,GPU.1,CPU
    HETERO:GPU.0,GPU.1,GPU.2
  • Use the ov::hint::model_distribution_policy property in PR23077

  • Use host memory in different subgraphs when input/output data exchange

  • Mask supported nodes and unsupported nodes to Subgraph in graph, and query model use subgraph, keep the model in query_model same as compile

  • Add a property ov::query_model_ratio to set the percentage of the model can be queried during query model

  • Improve performance on some LLM with large parameter by split the model to different devices, the number of split models should be smaller, reduce communication usage between multiple devices

Tickets:

@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: GPU OpenVINO GPU plugin category: HETERO OpenVINO HETERO plugin category: CPP API OpenVINO CPP API bindings labels Dec 19, 2023
@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Dec 20, 2023
@WeldonWangwang WeldonWangwang marked this pull request as ready for review December 20, 2023 07:52
@WeldonWangwang WeldonWangwang requested review from a team as code owners December 20, 2023 07:52
@peterchen-intel peterchen-intel changed the title Wangwang/hetero split llm [WIP] hetero split llm Dec 25, 2023
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Jan 20, 2024
@WeldonWangwang WeldonWangwang force-pushed the wangwang/hetero_split_llm branch from c90c0e7 to 899896e Compare January 25, 2024 09:17
@WeldonWangwang WeldonWangwang requested a review from a team as a code owner January 25, 2024 09:17
if (device_mem_map[device_name] >= 1.2 * total_ops_size) {
device_config[ov::query_model_ratio.name()] = 1.0f;
} else if (device_mem_map["all_gpu_left"] >= 1.2 * total_ops_size ||
device_mem_map.find("CPU") != device_mem_map.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have GPU hardcodes in common logic? can it be generalized on common discrete device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please review.

@@ -77,23 +81,101 @@ ov::hetero::Plugin::DeviceProperties ov::hetero::Plugin::get_properties_per_devi
return device_properties;
}

bool ov::hetero::Plugin::get_device_memory_map(const std::vector<std::string>& device_names,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments for TODO. There is no unified API to get device memory. So this feature get memory of specific device with specific method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

Configuration full_config{properties, m_cfg};
DeviceProperties properties_per_device =
get_properties_per_device(full_config.device_priorities, full_config.get_device_properties());

// WARNING: Here is devices with user set priority
auto device_names = ov::DeviceIDParser::get_hetero_devices(full_config.device_priorities);
bool hetero_query_model_by_device = false;
if (full_config.modelDistributionPolicy.count(ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL) != 0) {
hetero_query_model_by_device = get_device_memory_map(device_names, available_device_mem_map);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just skip device which cannot get device memory size here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

@WeldonWangwang WeldonWangwang force-pushed the wangwang/hetero_split_llm branch from 3bfe543 to a3840b9 Compare March 26, 2024 04:48
@wangleis wangleis requested a review from ilya-lavrenov March 26, 2024 07:25
if (device_name.find("CPU") != std::string::npos) {
// Assuming the CPU has enough memory
available_device_mem_map["CPU"] = -1;
} else if (device_name.find("GPU") != std::string::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (device_name.find("GPU") != std::string::npos) {
} else {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a device not GPU, we query memory size with ov::intel_gpu::device_total_mem_size seems strange?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe we can keep as it is for now.
btw, what if gpu already have other workloads, is this device_total_mem_size can be used for estimate in this case? or should we leverage gpu statistics?
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will return statistics of GPU memory allocated by engine for each allocation type, we can combined it with ov::intel_gpu::device_total_mem_size to check if we can obtain more accurate device available memory.
Based on our discussion, we will add this in the following optimization.

* @ingroup ov_runtime_cpp_prop_api
*/
static constexpr Property<float, PropertyMutability::RW> query_model_ratio{"QUERY_MODEL_RATIO"};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about this property, seems should be an internal property, or you want to let user manipulate this ratio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an internal property now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this property to internal property, please review.

}
}
}
if ((total_len < last_total_len || last_total_len == 0) && !cancel_split) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you paste perf data here on the benefit of this least data transmission searching effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@wangleis wangleis added this pull request to the merge queue Mar 29, 2024
Merged via the queue into openvinotoolkit:master with commit 3e114be Mar 29, 2024
109 checks passed
bbielawx pushed a commit to bbielawx/openvino that referenced this pull request Apr 12, 2024
…inotoolkit#21764)

### Details:
 - Support LLM inference
      |device|
      |--|
      |HETERO:CPU|
      |HETERO:GPU|
      |HETERO:CPU,GPU|
      |HETERO:GPU,CPU|
      |HETERO:GPU.0,GPU.1|
      |HETERO:GPU.0,GPU.1,CPU|
      |HETERO:GPU.0,GPU.1,GPU.2|

- Use the `ov::hint::model_distribution_policy` property in
[PR23077](openvinotoolkit#23077)
- Use host memory in different subgraphs when input/output data exchange
- Mask supported nodes and unsupported nodes to Subgraph in graph, and
query model use subgraph, keep the model in query_model same as compile
- Add a property `ov::query_model_ratio` to set the percentage of the
model can be queried during query model
- Improve performance on some LLM with large parameter by split the
model to different devices, the number of split models should be
smaller, reduce communication usage between multiple devices


### Tickets:
 - *CVS-133258*

---------

Co-authored-by: Nadezhda <[email protected]>
Co-authored-by: Shen, Wanglei <[email protected]>
Co-authored-by: yanlan song <[email protected]>
alvoron pushed a commit to alvoron/openvino that referenced this pull request Apr 29, 2024
…inotoolkit#21764)

### Details:
 - Support LLM inference
      |device|
      |--|
      |HETERO:CPU|
      |HETERO:GPU|
      |HETERO:CPU,GPU|
      |HETERO:GPU,CPU|
      |HETERO:GPU.0,GPU.1|
      |HETERO:GPU.0,GPU.1,CPU|
      |HETERO:GPU.0,GPU.1,GPU.2|

- Use the `ov::hint::model_distribution_policy` property in
[PR23077](openvinotoolkit#23077)
- Use host memory in different subgraphs when input/output data exchange
- Mask supported nodes and unsupported nodes to Subgraph in graph, and
query model use subgraph, keep the model in query_model same as compile
- Add a property `ov::query_model_ratio` to set the percentage of the
model can be queried during query model
- Improve performance on some LLM with large parameter by split the
model to different devices, the number of split models should be
smaller, reduce communication usage between multiple devices


### Tickets:
 - *CVS-133258*

---------

Co-authored-by: Nadezhda <[email protected]>
Co-authored-by: Shen, Wanglei <[email protected]>
Co-authored-by: yanlan song <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: Core OpenVINO Core (aka ngraph) category: CPP API OpenVINO CPP API bindings category: GPU OpenVINO GPU plugin category: HETERO OpenVINO HETERO plugin category: inference OpenVINO Runtime library - Inference category: Python API OpenVINO Python bindings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants