-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable fullyconnect parallel for per node core allocation #23593
Enable fullyconnect parallel for per node core allocation #23593
Conversation
…vino into fc_parallel_for_test
fix comments and clean some codes
fix windows warning
src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/unit/transformations/split_fc_test.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp
Outdated
Show resolved
Hide resolved
// needn't to split fc when the dim is 0. | ||
if (split_dim_range <= 1 || ov::shape_size(wgt_shape) < 6600000) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We can easily move this check to common part to avoid duplication on L172 (weights decompression ops don't affect weights output shape, so we can just use
wgt_item->get_shape()
in the checks). - Could you please explain what the magic number
6600000
is? And please also add explanatory comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We can easily move this check to common part to avoid duplication on L172 (weights decompression ops don't affect weights output shape, so we can just use
wgt_item->get_shape()
in the checks).- Could you please explain what the magic number
6600000
is? And please also add explanatory comment
- updated.
660000
is a threshold that is tested and summarized by existed KPI model. That is because some LLM's FC is small and split operation is degression. So,660000
is a trade-off value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a explanatory comment in the code in follow-up PR about this trade-off value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a explanatory comment in the code in follow-up PR about this trade-off value
sure. I will do it.
src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp
Outdated
Show resolved
Hide resolved
}; | ||
|
||
// TODO: support transpose | ||
if (ov::is_type<ov::op::v1::Transpose>(fc_weight_node)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation has the following limitations:
- Transpose on weights is not supported.
- Decompression without Subtract is not supported.
- Decompression with Subtract without convert (with f32 constant) is not supported.
- This transformation breaks postops/bias fusions for FullyConnected node.
All these limitations "pop up" when MatmulWeightsDecompression subgraph tests are running on multi socket platform: the corresponding tests fail.
For each limitation we need to decide whether it must be eliminated within this PR or the corresponding improvements can be implemented in follow-up PRs. If some of the limitations will not be eliminated in this PR, we need to decide what to do with the failed tests: we should probably skip them
|
||
const auto& fc_node = pattern_map.at(fc_m).get_node_shared_ptr(); | ||
auto& rt_info = fc_node->get_rt_info(); | ||
if (rt_info.count("parallelDomain")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical: It is better to create a class for this rt_info
instead of just std::string
usage. This will allow us to override ov::RuntimeAttribute
methods:
bool is_copyable()
: if I understand correctly, this rt info can be set only in this transformation and only for the splitFullyConnected
nodes, so this method must always return false because we need to avoid this info propagation to other nodesstd::string to_string()
: this will help us to get more info in case of debug serialization
This plugin rt info can be used as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical: It is better to create a class for this
rt_info
instead of juststd::string
usage. This will allow us to overrideov::RuntimeAttribute
methods:
bool is_copyable()
: if I understand correctly, this rt info can be set only in this transformation and only for the splitFullyConnected
nodes, so this method must always return false because we need to avoid this info propagation to other nodesstd::string to_string()
: this will help us to get more info in case of debug serializationThis plugin rt info can be used as an example.
I see. Your advice really makes sense. In order to better debug, this attribute should includes the info of op split path and then dump by to_string
api. At this milestone, it may not be possible to implement and test this feature in time. I plan to implement it in the follow-up PR.
src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/split_fc.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxnick @v-Golubev Guys, thanks for review!
Assuming this is initial implementation of TENSOR_PARALLEL feature the state seems to be good enough for the merge.
However we need to consider some follow-up steps to make the solution more dcalable and effective. Two major points:
- Generalize transformation part that should support arbitrary operation and separate tensor split, affinity markup stages. Transformation itself should be shareable between plugins
- Introduce subgraph abstraction level on plugin part to separate compliation/execution context of each parallel branch. Each subgraph should be processed within separate thread pool to ensure correct memory allocation/placement.
### Details: - Shapes inference can result into an appearance of the zero shape memory on the node inputs. Such nodes are considered as non-executable, so no prepareParams are called for them, thus no executors are created. So, it is necessary to avoid setting numa id for non-executable nodes. - Currently seg fault happens when node gets zero shape memory on the inputs. Issue introduced in scope of: - #23593 ### Tickets: - *CVS-137763*
…olkit#23593) ### Details: - *integrated [PR19801](openvinotoolkit#19801), [PR23007](openvinotoolkit#23007) and [PR23127](openvinotoolkit#23127 - enable sub streams for per node core allocation - update class ModelDistributionPolicy and class SubStreamsMode - refactor get_model_prefer_threads() with class ModelDistributionPolicy - remove get_default_latency_streams() since it is always 1 now - add sub streams to executor for per node core allocation - Improve the performance of Fully connect layer on 2-socket Xeon systems. ### Tickets: - *123078, 129972, 132954* --------- Co-authored-by: Shen, Wanglei <[email protected]> Co-authored-by: Xiuchuan Zhai <[email protected]> Co-authored-by: Vladislav Golubev <[email protected]>
…toolkit#23849) ### Details: - Shapes inference can result into an appearance of the zero shape memory on the node inputs. Such nodes are considered as non-executable, so no prepareParams are called for them, thus no executors are created. So, it is necessary to avoid setting numa id for non-executable nodes. - Currently seg fault happens when node gets zero shape memory on the inputs. Issue introduced in scope of: - openvinotoolkit#23593 ### Tickets: - *CVS-137763*
…olkit#23593) ### Details: - *integrated [PR19801](openvinotoolkit#19801), [PR23007](openvinotoolkit#23007) and [PR23127](openvinotoolkit#23127 - enable sub streams for per node core allocation - update class ModelDistributionPolicy and class SubStreamsMode - refactor get_model_prefer_threads() with class ModelDistributionPolicy - remove get_default_latency_streams() since it is always 1 now - add sub streams to executor for per node core allocation - Improve the performance of Fully connect layer on 2-socket Xeon systems. ### Tickets: - *123078, 129972, 132954* --------- Co-authored-by: Shen, Wanglei <[email protected]> Co-authored-by: Xiuchuan Zhai <[email protected]> Co-authored-by: Vladislav Golubev <[email protected]>
…toolkit#23849) ### Details: - Shapes inference can result into an appearance of the zero shape memory on the node inputs. Such nodes are considered as non-executable, so no prepareParams are called for them, thus no executors are created. So, it is necessary to avoid setting numa id for non-executable nodes. - Currently seg fault happens when node gets zero shape memory on the inputs. Issue introduced in scope of: - openvinotoolkit#23593 ### Tickets: - *CVS-137763*
Details:
Tickets: