New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Enable fullyconnect parallel for per node core allocation #23593

Merged

wangleis merged 90 commits into openvinotoolkit:master from sunxiaoxia2022:xiaoxia/fc_parallel_all

Apr 1, 2024

Contributor

sunxiaoxia2022 commented Mar 21, 2024 •

edited

Loading

Details:

integrated PR19801, PR23007 and PR23127
enable sub streams for per node core allocation
update class ModelDistributionPolicy and class SubStreamsMode
refactor get_model_prefer_threads() with class ModelDistributionPolicy
remove get_default_latency_streams() since it is always 1 now
add sub streams to executor for per node core allocation
Improve the performance of Fully connect layer on 2-socket Xeon systems.

Tickets:

123078, 129972, 132954

wangleis and others added 30 commits

October 28, 2023 12:56


          enable sub streams in stream calculation

bd1d3aa


          update sub streams structure

ae9b8b6


          fix code style issue

0f5bff3


          update header file

6980dd0


          Merge branch 'master' into enable_sub_streams

5e0b189


          update for master branch

b33fa7c


          update for max_threads_per_stream

c5aea9c


          add test cases

369a841


          fix code style issue

57405a7


          update default setting

ae2a248


          fix typo

8c792d7


          add sub streams to executor

2eb5019


          enable fc split and parallel running

12b95d1


          remove toNunaNode from fc fow now

4243a5c


          test

057b32f


          fix precision bug

74f08bb


          fix some warning

b480256


          add profiler

e9c2f98


          fix the performance drop

74cd632


          enable mlas on fp32

6b05b12


          Merge branch 'master' into enable_sub_streams

64ab003


          Merge branch 'master' into enable_sub_streams

c8c77c6


          add test case

f31fbbf


          update for model_prefer_threads

fa57c39


          update for model_prefer_threads

0bc0ded


          update for model_prefer_threads

845a57f


          update for model_prefer_threads

77b9a68


          fix split failure when fc dim0 is odd

b75bf03


          fix tput mode issue

891b904


          Merge branch 'fc_parallel_for_test' of https://github.com/xczhai/open…

5196ca9

…vino into fc_parallel_for_test

sunxiaoxia2022 and others added 6 commits

March 28, 2024 16:39


          Merge pull request #7 from xczhai/opt/fc_parall_all

b2117c2

fix comments and clean some codes


          fix windows warning

7bab234


          Merge pull request #8 from xczhai/fc_parallel_fix_warning

e2b90d2

fix windows warning


          fix comments, add run_sub_stream_and_wait()

e21c622


          fix build issue

a12514b


          fix warning

9de9cd4

v-Golubev reviewed

View reviewed changes

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.hpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/tests/unit/transformations/split_fc_test.cpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp Outdated

Comment on lines 97 to 100

+                          // needn't to split fc when the dim is 0.
+                          if (split_dim_range <= 1 || ov::shape_size(wgt_shape) < 6600000) {
+                              return false;
+                          }

Contributor

v-Golubev Mar 28, 2024

We can easily move this check to common part to avoid duplication on L172 (weights decompression ops don't affect weights output shape, so we can just use wgt_item->get_shape() in the checks).
Could you please explain what the magic number 6600000 is? And please also add explanatory comment

Contributor

xczhai Mar 29, 2024 •

edited

Loading

We can easily move this check to common part to avoid duplication on L172 (weights decompression ops don't affect weights output shape, so we can just use wgt_item->get_shape() in the checks).

Could you please explain what the magic number 6600000 is? And please also add explanatory comment

updated.
660000 is a threshold that is tested and summarized by existed KPI model. That is because some LLM's FC is small and split operation is degression. So, 660000 is a trade-off value.

Contributor

v-Golubev Apr 2, 2024

Please add a explanatory comment in the code in follow-up PR about this trade-off value

Contributor

xczhai Apr 3, 2024

Please add a explanatory comment in the code in follow-up PR about this trade-off value

sure. I will do it.

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp

+                      };
+                      // TODO: support transpose
+                      if (ov::is_type<ov::op::v1::Transpose>(fc_weight_node)) {

Contributor

v-Golubev Mar 28, 2024

Current implementation has the following limitations:

Transpose on weights is not supported.
Decompression without Subtract is not supported.
Decompression with Subtract without convert (with f32 constant) is not supported.
This transformation breaks postops/bias fusions for FullyConnected node.

All these limitations "pop up" when MatmulWeightsDecompression subgraph tests are running on multi socket platform: the corresponding tests fail.
For each limitation we need to decide whether it must be eliminated within this PR or the corresponding improvements can be implemented in follow-up PRs. If some of the limitations will not be eliminated in this PR, we need to decide what to do with the failed tests: we should probably skip them

CC @dmitry-gorokhov

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp

+                      const auto& fc_node = pattern_map.at(fc_m).get_node_shared_ptr();
+                      auto& rt_info = fc_node->get_rt_info();
+                      if (rt_info.count("parallelDomain")) {

Contributor

v-Golubev Mar 28, 2024

Not critical: It is better to create a class for this rt_info instead of just std::string usage. This will allow us to override ov::RuntimeAttribute methods:

bool is_copyable(): if I understand correctly, this rt info can be set only in this transformation and only for the split FullyConnected nodes, so this method must always return false because we need to avoid this info propagation to other nodes
std::string to_string(): this will help us to get more info in case of debug serialization

This plugin rt info can be used as an example.

Contributor

xczhai Mar 29, 2024

Not critical: It is better to create a class for this rt_info instead of just std::string usage. This will allow us to override ov::RuntimeAttribute methods:

bool is_copyable(): if I understand correctly, this rt info can be set only in this transformation and only for the split FullyConnected nodes, so this method must always return false because we need to avoid this info propagation to other nodes

std::string to_string(): this will help us to get more info in case of debug serialization

This plugin rt info can be used as an example.

I see. Your advice really makes sense. In order to better debug, this attribute should includes the info of op split path and then dump by to_string api. At this milestone, it may not be possible to implement and test this feature in time. I plan to implement it in the follow-up PR.

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp Outdated Show resolved Hide resolved

maxnick approved these changes

View reviewed changes

wangleis reviewed

View reviewed changes

src/plugins/intel_cpu/src/graph.cpp Show resolved Hide resolved


          fix comments in pass/test/graph

5da8f1d

xczhai requested a review from v-Golubev

March 29, 2024 08:31

sunxiaoxia2022 and others added 3 commits

March 29, 2024 22:24


          Merge pull request #9 from xczhai/fix_fc_parallel_pass

62ac883

fix comments in pass/test/graph


          fix subStreamID is not real numa id

9fe61d6


          Merge branch 'master' into xiaoxia/fc_parallel_all

5d15066

dmitry-gorokhov approved these changes

View reviewed changes

Contributor

dmitry-gorokhov left a comment

@maxnick @v-Golubev Guys, thanks for review!

Assuming this is initial implementation of TENSOR_PARALLEL feature the state seems to be good enough for the merge.
However we need to consider some follow-up steps to make the solution more dcalable and effective. Two major points:

Generalize transformation part that should support arbitrary operation and separate tensor split, affinity markup stages. Transformation itself should be shareable between plugins
Introduce subgraph abstraction level on plugin part to separate compliation/execution context of each parallel branch. Each subgraph should be processed within separate thread pool to ensure correct memory allocation/placement.

xczhai and others added 3 commits

April 1, 2024 11:07


          just enable on x86_64 linux platform

d9b121a


          Merge pull request #10 from xczhai/just_enable_on_x86_64

7afd3c3

just enable on x86_64 linux platform


          fix issue when model prefer threads is bigger than number of threads

a9364ce

wangleis approved these changes

View reviewed changes

wangleis enabled auto-merge

April 1, 2024 13:40

wangleis added this pull request to the merge queue

Merged via the queue into openvinotoolkit:master with commit e2c6ae9

108 checks passed

wangleis deleted the xiaoxia/fc_parallel_all branch

April 1, 2024 19:05

EgorDuplensky mentioned this pull request

[CPU] Do not assign Numa id to a node with zero shape input #23849

Merged

github-merge-queue bot pushed a commit that referenced this pull request


          [CPU] Do not assign Numa id to a node with zero shape input (#23849)

251f8e2

### Details:
- Shapes inference can result into an appearance of the zero shape
memory on the node inputs. Such nodes are considered as non-executable,
so no prepareParams are called for them, thus no
executors are created. So, it is necessary to avoid setting numa id for
non-executable nodes.
- Currently seg fault happens when node gets zero shape memory on the
inputs.
 
Issue introduced in scope of:
- #23593

### Tickets:
 - *CVS-137763*

bbielawx pushed a commit to bbielawx/openvino that referenced this pull request


          Enable fullyconnect parallel for per node core allocation (openvinoto…

a6a50bc

…olkit#23593)

### Details:
- *integrated
[PR19801](openvinotoolkit#19801),
[PR23007](openvinotoolkit#23007) and
[PR23127](openvinotoolkit#23127
 - enable sub streams for per node core allocation
 - update class ModelDistributionPolicy and class SubStreamsMode
- refactor get_model_prefer_threads() with class ModelDistributionPolicy
 - remove get_default_latency_streams() since it is always 1 now
 - add sub streams to executor for per node core allocation
- Improve the performance of Fully connect layer on 2-socket Xeon
systems.

### Tickets:
 - *123078, 129972, 132954*

---------

Co-authored-by: Shen, Wanglei <[email protected]>
Co-authored-by: Xiuchuan Zhai <[email protected]>
Co-authored-by: Vladislav Golubev <[email protected]>

bbielawx pushed a commit to bbielawx/openvino that referenced this pull request


          [CPU] Do not assign Numa id to a node with zero shape input (openvino…

4b67699

…toolkit#23849)

### Details:
- Shapes inference can result into an appearance of the zero shape
memory on the node inputs. Such nodes are considered as non-executable,
so no prepareParams are called for them, thus no
executors are created. So, it is necessary to avoid setting numa id for
non-executable nodes.
- Currently seg fault happens when node gets zero shape memory on the
inputs.
 
Issue introduced in scope of:
- openvinotoolkit#23593

### Tickets:
 - *CVS-137763*

alvoron pushed a commit to alvoron/openvino that referenced this pull request


          Enable fullyconnect parallel for per node core allocation (openvinoto…

cb9471a

…olkit#23593)

### Details:
- *integrated
[PR19801](openvinotoolkit#19801),
[PR23007](openvinotoolkit#23007) and
[PR23127](openvinotoolkit#23127
 - enable sub streams for per node core allocation
 - update class ModelDistributionPolicy and class SubStreamsMode
- refactor get_model_prefer_threads() with class ModelDistributionPolicy
 - remove get_default_latency_streams() since it is always 1 now
 - add sub streams to executor for per node core allocation
- Improve the performance of Fully connect layer on 2-socket Xeon
systems.

### Tickets:
 - *123078, 129972, 132954*

---------

Co-authored-by: Shen, Wanglei <[email protected]>
Co-authored-by: Xiuchuan Zhai <[email protected]>
Co-authored-by: Vladislav Golubev <[email protected]>

alvoron pushed a commit to alvoron/openvino that referenced this pull request


          [CPU] Do not assign Numa id to a node with zero shape input (openvino…

c33487a

…toolkit#23849)

### Details:
- Shapes inference can result into an appearance of the zero shape
memory on the node inputs. Such nodes are considered as non-executable,
so no prepareParams are called for them, thus no
executors are created. So, it is necessary to avoid setting numa id for
non-executable nodes.
- Currently seg fault happens when node gets zero shape memory on the
inputs.
 
Issue introduced in scope of:
- openvinotoolkit#23593

### Tickets:
 - *CVS-137763*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

dmitry-gorokhov dmitry-gorokhov approved these changes

wangleis wangleis approved these changes

maxnick maxnick approved these changes

peterchen-intel Awaiting requested review from peterchen-intel

riverlijunjie Awaiting requested review from riverlijunjie

xufang-lisa Awaiting requested review from xufang-lisa

xczhai Awaiting requested review from xczhai

zKulesza Awaiting requested review from zKulesza zKulesza was automatically assigned from openvinotoolkit/openvino-docs-maintainers

v-Golubev Awaiting requested review from v-Golubev

Labels

category: CPU category: inference Code Freeze