Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable fullyconnect parallel for per node core allocation #23593

Merged

Conversation

sunxiaoxia2022
Copy link
Contributor

@sunxiaoxia2022 sunxiaoxia2022 commented Mar 21, 2024

Details:

  • integrated PR19801, PR23007 and PR23127
  • enable sub streams for per node core allocation
  • update class ModelDistributionPolicy and class SubStreamsMode
  • refactor get_model_prefer_threads() with class ModelDistributionPolicy
  • remove get_default_latency_streams() since it is always 1 now
  • add sub streams to executor for per node core allocation
  • Improve the performance of Fully connect layer on 2-socket Xeon systems.

Tickets:

  • 123078, 129972, 132954

Comment on lines 97 to 100
// needn't to split fc when the dim is 0.
if (split_dim_range <= 1 || ov::shape_size(wgt_shape) < 6600000) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We can easily move this check to common part to avoid duplication on L172 (weights decompression ops don't affect weights output shape, so we can just use wgt_item->get_shape() in the checks).
  2. Could you please explain what the magic number 6600000 is? And please also add explanatory comment

Copy link
Contributor

@xczhai xczhai Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We can easily move this check to common part to avoid duplication on L172 (weights decompression ops don't affect weights output shape, so we can just use wgt_item->get_shape() in the checks).
  2. Could you please explain what the magic number 6600000 is? And please also add explanatory comment
  1. updated.
  2. 660000 is a threshold that is tested and summarized by existed KPI model. That is because some LLM's FC is small and split operation is degression. So, 660000 is a trade-off value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a explanatory comment in the code in follow-up PR about this trade-off value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a explanatory comment in the code in follow-up PR about this trade-off value

sure. I will do it.

};

// TODO: support transpose
if (ov::is_type<ov::op::v1::Transpose>(fc_weight_node)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation has the following limitations:

  1. Transpose on weights is not supported.
  2. Decompression without Subtract is not supported.
  3. Decompression with Subtract without convert (with f32 constant) is not supported.
  4. This transformation breaks postops/bias fusions for FullyConnected node.

All these limitations "pop up" when MatmulWeightsDecompression subgraph tests are running on multi socket platform: the corresponding tests fail.
For each limitation we need to decide whether it must be eliminated within this PR or the corresponding improvements can be implemented in follow-up PRs. If some of the limitations will not be eliminated in this PR, we need to decide what to do with the failed tests: we should probably skip them

CC @dmitry-gorokhov


const auto& fc_node = pattern_map.at(fc_m).get_node_shared_ptr();
auto& rt_info = fc_node->get_rt_info();
if (rt_info.count("parallelDomain")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical: It is better to create a class for this rt_info instead of just std::string usage. This will allow us to override ov::RuntimeAttribute methods:

  1. bool is_copyable(): if I understand correctly, this rt info can be set only in this transformation and only for the split FullyConnected nodes, so this method must always return false because we need to avoid this info propagation to other nodes
  2. std::string to_string(): this will help us to get more info in case of debug serialization

This plugin rt info can be used as an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical: It is better to create a class for this rt_info instead of just std::string usage. This will allow us to override ov::RuntimeAttribute methods:

  1. bool is_copyable(): if I understand correctly, this rt info can be set only in this transformation and only for the split FullyConnected nodes, so this method must always return false because we need to avoid this info propagation to other nodes
  2. std::string to_string(): this will help us to get more info in case of debug serialization

This plugin rt info can be used as an example.

I see. Your advice really makes sense. In order to better debug, this attribute should includes the info of op split path and then dump by to_string api. At this milestone, it may not be possible to implement and test this feature in time. I plan to implement it in the follow-up PR.

@xczhai xczhai requested a review from v-Golubev March 29, 2024 08:31
Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnick @v-Golubev Guys, thanks for review!

Assuming this is initial implementation of TENSOR_PARALLEL feature the state seems to be good enough for the merge.
However we need to consider some follow-up steps to make the solution more dcalable and effective. Two major points:

  • Generalize transformation part that should support arbitrary operation and separate tensor split, affinity markup stages. Transformation itself should be shareable between plugins
  • Introduce subgraph abstraction level on plugin part to separate compliation/execution context of each parallel branch. Each subgraph should be processed within separate thread pool to ensure correct memory allocation/placement.

@wangleis wangleis enabled auto-merge April 1, 2024 13:40
@wangleis wangleis added this pull request to the merge queue Apr 1, 2024
Merged via the queue into openvinotoolkit:master with commit e2c6ae9 Apr 1, 2024
108 checks passed
@wangleis wangleis deleted the xiaoxia/fc_parallel_all branch April 1, 2024 19:05
github-merge-queue bot pushed a commit that referenced this pull request Apr 4, 2024
### Details:
- Shapes inference can result into an appearance of the zero shape
memory on the node inputs. Such nodes are considered as non-executable,
so no prepareParams are called for them, thus no
executors are created. So, it is necessary to avoid setting numa id for
non-executable nodes.
- Currently seg fault happens when node gets zero shape memory on the
inputs.
 
Issue introduced in scope of:
- #23593

### Tickets:
 - *CVS-137763*
bbielawx pushed a commit to bbielawx/openvino that referenced this pull request Apr 12, 2024
…olkit#23593)

### Details:
- *integrated
[PR19801](openvinotoolkit#19801),
[PR23007](openvinotoolkit#23007) and
[PR23127](openvinotoolkit#23127
 - enable sub streams for per node core allocation
 - update class ModelDistributionPolicy and class SubStreamsMode
- refactor get_model_prefer_threads() with class ModelDistributionPolicy
 - remove get_default_latency_streams() since it is always 1 now
 - add sub streams to executor for per node core allocation
- Improve the performance of Fully connect layer on 2-socket Xeon
systems.

### Tickets:
 - *123078, 129972, 132954*

---------

Co-authored-by: Shen, Wanglei <[email protected]>
Co-authored-by: Xiuchuan Zhai <[email protected]>
Co-authored-by: Vladislav Golubev <[email protected]>
bbielawx pushed a commit to bbielawx/openvino that referenced this pull request Apr 12, 2024
…toolkit#23849)

### Details:
- Shapes inference can result into an appearance of the zero shape
memory on the node inputs. Such nodes are considered as non-executable,
so no prepareParams are called for them, thus no
executors are created. So, it is necessary to avoid setting numa id for
non-executable nodes.
- Currently seg fault happens when node gets zero shape memory on the
inputs.
 
Issue introduced in scope of:
- openvinotoolkit#23593

### Tickets:
 - *CVS-137763*
alvoron pushed a commit to alvoron/openvino that referenced this pull request Apr 29, 2024
…olkit#23593)

### Details:
- *integrated
[PR19801](openvinotoolkit#19801),
[PR23007](openvinotoolkit#23007) and
[PR23127](openvinotoolkit#23127
 - enable sub streams for per node core allocation
 - update class ModelDistributionPolicy and class SubStreamsMode
- refactor get_model_prefer_threads() with class ModelDistributionPolicy
 - remove get_default_latency_streams() since it is always 1 now
 - add sub streams to executor for per node core allocation
- Improve the performance of Fully connect layer on 2-socket Xeon
systems.

### Tickets:
 - *123078, 129972, 132954*

---------

Co-authored-by: Shen, Wanglei <[email protected]>
Co-authored-by: Xiuchuan Zhai <[email protected]>
Co-authored-by: Vladislav Golubev <[email protected]>
alvoron pushed a commit to alvoron/openvino that referenced this pull request Apr 29, 2024
…toolkit#23849)

### Details:
- Shapes inference can result into an appearance of the zero shape
memory on the node inputs. Such nodes are considered as non-executable,
so no prepareParams are called for them, thus no
executors are created. So, it is necessary to avoid setting numa id for
non-executable nodes.
- Currently seg fault happens when node gets zero shape memory on the
inputs.
 
Issue introduced in scope of:
- openvinotoolkit#23593

### Tickets:
 - *CVS-137763*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin category: inference OpenVINO Runtime library - Inference Code Freeze
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants