CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro #1092

alvoron · 2024-02-15T09:43:14Z

PR https://review.mlplatform.org/c/ml/ComputeLibrary/+/10526 makes CpuGemmConv2d slower on Apple M2 / M2 Pro.

The numbers below were collected on M2 Pro.
On mobilenet-v2-1.0-224 CpuGemmConv2d takes 3.18 ms before the PR and 4.12 after the PR was merged.
resnet-50-pytorch - 16.37 ms before the PR; 19.67 ms after the PR

So, we have 20-30% performance degradation on CNN.

@sicong-li-arm @gunes-arm @aniraj01

The text was updated successfully, but these errors were encountered:

morgolock · 2024-02-15T10:14:20Z

Hi @alvoron

Thanks for reporting this.

Would you please let us know how many inferences/iterations you are running?

alvoron · 2024-02-15T17:21:53Z

I run model 30 sec and calculate average exec time of each operation type.
So, I have 7319 iterations of mobilenet-v2-1.0-224 and 1536 iterations of resnet-50-pytorch.

gunes-arm · 2024-02-15T17:55:53Z

Hi @alvoron

The mentioned patch should affect the start-up time, i.e. the first iteration only. I wonder if your runs configure() each time, or configure() only in the first iteration and run() in the remaining ones.

alvoron · 2024-02-16T10:46:32Z

OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via acl_gemm_convolution_fwd_t::create_resource() method:
https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44

morgolock · 2024-02-21T13:57:47Z

Hi @alvoron

I ran ACL's benchmark_graph_mobilenet_v2 on a device with M2 but I could not see a significant performance degradation.

See below the execution including the patch that you mentioned

% ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000  --example_args='--threads=1,--target=NEON,--type=F32'
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'c5ab4df0c11dc66db47f2070edc719923af3367e'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32 
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=6620.1732 us, STDDEV=2.62 %, MIN=6594.0000 us, MAX=10888.0000 us, MEDIAN=6608.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)

And this is without the patch

ComputeLibrary % ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000  --example_args='--threads=1,--target=NEON,--type=F32' 
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'4a9dbedfbfa66c2612c7461e60cd867b8aea825b'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32 
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2_reverted'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=6600.4505 us, STDDEV=0.88 %, MIN=6581.0000 us, MAX=8123.0000 us, MEDIAN=6596.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)

6620.1732 us - AVG=6600.4505 us = 19.7227 us
19.7227 us / 6620.1732 us = 0.003

Would you please confirm if you experience the problem on other devices?
Can you please share the models you are running? Are there tflite files?

jondea · 2024-02-22T11:54:35Z

OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via acl_gemm_convolution_fwd_t::create_resource() method: https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44

With DNNL_VERBOSE enabled, is OpenVINO recreating the resource or is it getting oneDNN cache hits? Some frameworks have their own caching mechanisms

alvoron · 2024-02-22T16:57:14Z

It seems the issue could be reproduced via benchdnn without OpenVINO.

ACL build command:
scons neon=1 opencl=0 openmp=0 cppthreads=1 os=macos data_layout_support=all arch=arm64-v8.2-a build=native --jobs=8 os=macos build=native compiler_cache=ccache compiler_prefix="/Library/Developer/CommandLineTools/usr/bin/" --silent fixed_format_kernels=True

onednn configure command (run in onednn root dir):
ACL_ROOT_DIR=$PWD/../ComputeLibrary cmake -B build -DCMAKE_BUILD_TYPE=Release -DDNNL_USE_ACL=ON -DCMAKE_RULE_MESSAGES=OFF -DACL_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.dylib -DACL_CORE_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_core.dylib -DACL_GRAPH_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_graph.dylib

benchdnn build command:
cmake --build build --target benchdnn --parallel 7

The reproducer:
DYLD_LIBRARY_PATH=$PWD/../ComputeLibrary/build ./build/tests/benchdnn/benchdnn --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic1280oc1001_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0

On M2 Pro I've got min(ms):0.255333 avg(ms):0.357945 on ACL SHA c5ab4df0c11dc66db47f2070edc719923af3367e and min(ms):0.042875 avg(ms):0.0624329 on SHA 4a9dbedfbfa66c2612c7461e60cd867b8aea825b.

@morgolock could you please try to repeat these steps?

UPD:
Couple comments:

Please take oneDNN fork that is used by OpenVINO: https://github.com/openvinotoolkit/oneDNN (SHA - 4e29b771fcdfab5bdb219a495e694d6206e52b67)
You need to apply 2 small changes to oneDNN to adopt new version of ACL: openvinotoolkit/oneDNN@19bb9f2...d76046a
I reproduced the issue using benchdnn on Mac M1 mini: total perf: min(ms):0.273542 avg(ms):0.309104 on c5ab4df0c11dc66db47f2070edc719923af3367e and total perf: min(ms):0.0366251 avg(ms):0.0638425 on 4a9dbedfbfa66c2612c7461e60cd867b8aea825b

morgolock · 2024-03-06T14:23:58Z

Hi @alvoron

Thanks for reporting this performance regression and providing so much detail.

We have merged a patch fixing the problem into the main development branch and we will do a patch release of 24.02 including the fix mentioned above.

Hope this helps

morgolock · 2024-04-02T10:30:36Z

Hi @alvoron

Closing this as it was fixed in 24.02.1

Please reopen if you require further assistance.

alvoron mentioned this issue Feb 15, 2024

[CPU][ARM] Upgrade to ACL v24.02.1 openvinotoolkit/openvino#22598

Merged

morgolock added the Performance label Feb 15, 2024

morgolock added this to the v24.02.1 milestone Mar 6, 2024

morgolock added the Bug label Mar 6, 2024

morgolock closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro #1092

CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro #1092

alvoron commented Feb 15, 2024 •

edited

Loading

morgolock commented Feb 15, 2024

alvoron commented Feb 15, 2024

gunes-arm commented Feb 15, 2024

alvoron commented Feb 16, 2024 •

edited

Loading

morgolock commented Feb 21, 2024

jondea commented Feb 22, 2024

alvoron commented Feb 22, 2024 •

edited

Loading

morgolock commented Mar 6, 2024

morgolock commented Apr 2, 2024

CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro #1092

CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro #1092

Comments

alvoron commented Feb 15, 2024 • edited Loading

morgolock commented Feb 15, 2024

alvoron commented Feb 15, 2024

gunes-arm commented Feb 15, 2024

alvoron commented Feb 16, 2024 • edited Loading

morgolock commented Feb 21, 2024

jondea commented Feb 22, 2024

alvoron commented Feb 22, 2024 • edited Loading

morgolock commented Mar 6, 2024

morgolock commented Apr 2, 2024

alvoron commented Feb 15, 2024 •

edited

Loading

alvoron commented Feb 16, 2024 •

edited

Loading

alvoron commented Feb 22, 2024 •

edited

Loading