[HETERO] support LLM and split model per available memory size #21764

WeldonWangwang · 2023-12-19T15:46:59Z

Details:

Support LLM inference

device

HETERO:CPU

HETERO:GPU

HETERO:CPU,GPU

HETERO:GPU,CPU

HETERO:GPU.0,GPU.1

HETERO:GPU.0,GPU.1,CPU

HETERO:GPU.0,GPU.1,GPU.2
Use the ov::hint::model_distribution_policy property in PR23077
Use host memory in different subgraphs when input/output data exchange
Mask supported nodes and unsupported nodes to Subgraph in graph, and query model use subgraph, keep the model in query_model same as compile
Add a property ov::query_model_ratio to set the percentage of the model can be queried during query model
Improve performance on some LLM with large parameter by split the model to different devices, the number of split models should be smaller, reduce communication usage between multiple devices

Tickets:

CVS-133258

src/plugins/intel_gpu/src/plugin/plugin.cpp

src/inference/src/dev/iplugin.cpp

github-actions · 2024-01-20T00:17:57Z

This PR will be closed in a week because of 2 weeks of no activity.

Co-authored-by: Wanglei Shen <[email protected]>

ilya-lavrenov · 2024-03-25T14:06:32Z

src/plugins/hetero/src/plugin.cpp

+                if (device_mem_map[device_name] >= 1.2 * total_ops_size) {
+                    device_config[ov::query_model_ratio.name()] = 1.0f;
+                } else if (device_mem_map["all_gpu_left"] >= 1.2 * total_ops_size ||
+                           device_mem_map.find("CPU") != device_mem_map.end()) {


why do we have GPU hardcodes in common logic? can it be generalized on common discrete device?

Updated, please review.

src/inference/include/openvino/runtime/properties.hpp

wangleis · 2024-03-26T03:50:02Z

src/plugins/hetero/src/plugin.cpp

@@ -77,23 +81,101 @@ ov::hetero::Plugin::DeviceProperties ov::hetero::Plugin::get_properties_per_devi
    return device_properties;
 }

+bool ov::hetero::Plugin::get_device_memory_map(const std::vector<std::string>& device_names,


Please add comments for TODO. There is no unified API to get device memory. So this feature get memory of specific device with specific method.

Updated, thanks.

wangleis · 2024-03-26T03:52:08Z

src/plugins/hetero/src/plugin.cpp

    Configuration full_config{properties, m_cfg};
    DeviceProperties properties_per_device =
        get_properties_per_device(full_config.device_priorities, full_config.get_device_properties());

    //  WARNING: Here is devices with user set priority
    auto device_names = ov::DeviceIDParser::get_hetero_devices(full_config.device_priorities);
+    bool hetero_query_model_by_device = false;
+    if (full_config.modelDistributionPolicy.count(ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL) != 0) {
+        hetero_query_model_by_device = get_device_memory_map(device_names, available_device_mem_map);


Please just skip device which cannot get device memory size here.

Updated, thanks.

src/bindings/python/src/pyopenvino/core/properties/properties.cpp

src/inference/tests/unit/query_model_test.cpp

songbell · 2024-03-26T07:36:03Z

src/plugins/hetero/src/plugin.cpp

+        if (device_name.find("CPU") != std::string::npos) {
+            // Assuming the CPU has enough memory
+            available_device_mem_map["CPU"] = -1;
+        } else if (device_name.find("GPU") != std::string::npos) {


Suggested change

} else if (device_name.find("GPU") != std::string::npos) {

} else {

If there is a device not GPU, we query memory size with ov::intel_gpu::device_total_mem_size seems strange?

ok, maybe we can keep as it is for now.
btw, what if gpu already have other workloads, is this device_total_mem_size can be used for estimate in this case? or should we leverage gpu statistics?

It will return statistics of GPU memory allocated by engine for each allocation type, we can combined it with ov::intel_gpu::device_total_mem_size to check if we can obtain more accurate device available memory.
Based on our discussion, we will add this in the following optimization.

src/inference/src/dev/iplugin.cpp

songbell · 2024-03-26T08:29:39Z

src/inference/include/openvino/runtime/properties.hpp

+ * @ingroup ov_runtime_cpp_prop_api
+ */
+static constexpr Property<float, PropertyMutability::RW> query_model_ratio{"QUERY_MODEL_RATIO"};
+


about this property, seems should be an internal property, or you want to let user manipulate this ratio?

It's an internal property now.

Move this property to internal property, please review.

src/inference/include/openvino/runtime/properties.hpp

songbell · 2024-03-26T08:44:14Z

src/inference/src/dev/iplugin.cpp

+                    }
+                }
+            }
+            if ((total_len < last_total_len || last_total_len == 0) && !cancel_split) {


can you paste perf data here on the benefit of this least data transmission searching effort?

Co-authored-by: yanlan song <[email protected]>

…inotoolkit#21764) ### Details: - Support LLM inference |device| |--| |HETERO:CPU| |HETERO:GPU| |HETERO:CPU,GPU| |HETERO:GPU,CPU| |HETERO:GPU.0,GPU.1| |HETERO:GPU.0,GPU.1,CPU| |HETERO:GPU.0,GPU.1,GPU.2| - Use the `ov::hint::model_distribution_policy` property in [PR23077](openvinotoolkit#23077) - Use host memory in different subgraphs when input/output data exchange - Mask supported nodes and unsupported nodes to Subgraph in graph, and query model use subgraph, keep the model in query_model same as compile - Add a property `ov::query_model_ratio` to set the percentage of the model can be queried during query model - Improve performance on some LLM with large parameter by split the model to different devices, the number of split models should be smaller, reduce communication usage between multiple devices ### Tickets: - *CVS-133258* --------- Co-authored-by: Nadezhda <[email protected]> Co-authored-by: Shen, Wanglei <[email protected]> Co-authored-by: yanlan song <[email protected]>

nkogteva and others added 6 commits December 19, 2023 23:25

[HETERO][GPU] Add memory control property and callback for QM

ad74b8d

Fix the precision result

75b223c

HETERO support CPU device

896fc2a

Remove constant folding in HETERO to keep compressed op

94c4e7c

HETERO allocated system memory for output

5be006e

Update the policy of spliting graph

83dd3c4

WeldonWangwang requested review from peterchen-intel, wangleis, songbell and yangwang201911 December 19, 2023 15:46

github-actions bot added category: inference OpenVINO Runtime library - Inference category: GPU OpenVINO GPU plugin category: HETERO OpenVINO HETERO plugin category: CPP API OpenVINO CPP API bindings labels Dec 19, 2023

Fix code format

3b560f0

github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Dec 20, 2023

WeldonWangwang added do_not_review do_not_merge labels Dec 20, 2023

WeldonWangwang marked this pull request as ready for review December 20, 2023 07:52

WeldonWangwang requested review from a team as code owners December 20, 2023 07:52

peterchen-intel changed the title ~~Wangwang/hetero split llm~~ [WIP] hetero split llm Dec 25, 2023

songbell reviewed Jan 4, 2024

View reviewed changes

src/plugins/intel_gpu/src/plugin/plugin.cpp Outdated Show resolved Hide resolved

src/inference/src/dev/iplugin.cpp Outdated Show resolved Hide resolved

github-actions bot added the Stale label Jan 20, 2024

WeldonWangwang force-pushed the wangwang/hetero_split_llm branch from c90c0e7 to 899896e Compare January 25, 2024 09:17

WeldonWangwang requested a review from a team as a code owner January 25, 2024 09:17

Update src/inference/include/openvino/runtime/properties.hpp

c6e9d38

Co-authored-by: Wanglei Shen <[email protected]>

wangleis approved these changes Mar 25, 2024

View reviewed changes

WeldonWangwang removed the do_not_merge label Mar 25, 2024

ilya-lavrenov reviewed Mar 25, 2024

View reviewed changes

Remove some hard code

8fb365e

wangleis reviewed Mar 26, 2024

View reviewed changes

Skip device which cannot get device memory size

0d87048

songbell reviewed Mar 26, 2024

View reviewed changes

src/bindings/python/src/pyopenvino/core/properties/properties.cpp Show resolved Hide resolved

Updated description for ModelDistributionPolicy property

a3840b9

WeldonWangwang force-pushed the wangwang/hetero_split_llm branch from 3bfe543 to a3840b9 Compare March 26, 2024 04:48

wangleis requested a review from ilya-lavrenov March 26, 2024 07:25

songbell reviewed Mar 26, 2024

View reviewed changes

WeldonWangwang and others added 6 commits March 26, 2024 17:26

Update src/inference/include/openvino/runtime/properties.hpp

c10f832

Co-authored-by: yanlan song <[email protected]>

Merge branch 'master' into wangwang/hetero_split_llm

8dd01d3

Fix code style

44762f1

Fix code style

2ed0ead

Add more test case

c9ae3c2

Fix code style

2ab0988

peterchen-intel assigned wangleis Mar 27, 2024

WeldonWangwang added 4 commits March 27, 2024 10:53

Rename some variables

f587307

Merge branch 'master' into wangwang/hetero_split_llm

d33c3ac

Rename test function name

9b650b2

Move the QUERY_MODEL_RATIO property to internal property

c5fcc16

wangleis approved these changes Mar 29, 2024

View reviewed changes

wangleis added this pull request to the merge queue Mar 29, 2024

Merged via the queue into openvinotoolkit:master with commit 3e114be Mar 29, 2024
109 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HETERO] support LLM and split model per available memory size #21764

[HETERO] support LLM and split model per available memory size #21764

WeldonWangwang commented Dec 19, 2023 •

edited

Loading

github-actions bot commented Jan 20, 2024

ilya-lavrenov Mar 25, 2024

WeldonWangwang Mar 26, 2024

wangleis Mar 26, 2024

WeldonWangwang Mar 26, 2024

wangleis Mar 26, 2024

WeldonWangwang Mar 26, 2024

songbell Mar 26, 2024

WeldonWangwang Mar 26, 2024

songbell Mar 28, 2024

WeldonWangwang Mar 28, 2024

songbell Mar 26, 2024

WeldonWangwang Mar 26, 2024

WeldonWangwang Mar 28, 2024

songbell Mar 26, 2024

WeldonWangwang Mar 26, 2024

	} else if (device_name.find("GPU") != std::string::npos) {
	} else {

device
HETERO:CPU
HETERO:GPU
HETERO:CPU,GPU
HETERO:GPU,CPU
HETERO:GPU.0,GPU.1
HETERO:GPU.0,GPU.1,CPU
HETERO:GPU.0,GPU.1,GPU.2

[HETERO] support LLM and split model per available memory size #21764

[HETERO] support LLM and split model per available memory size #21764

Conversation

WeldonWangwang commented Dec 19, 2023 • edited Loading

Details:

Tickets:

github-actions bot commented Jan 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeldonWangwang commented Dec 19, 2023 •

edited

Loading