[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

chraac · 2025-02-25T07:19:54Z

Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1

GGML Adaptation Layer
- Graph Caching, Mapping, and Execution:
  - Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
  - Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
  - Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
- Tensor Binding and Execution Flow:
  - Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
  - Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
QNN Object Layer
- QNN System and Instance Management:
  - Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
  - Manages QNN instance creation and initialization via the qnn_instance class
  - Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
  - Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
- Dynamic Resource Handling:
  - Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
  - Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
Utility Layer
- Dynamic Library Loading & Search Path Management:
  - Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
  - Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
- General Utilities:
  - Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

Graph Mapping Mechanism:
- Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
- Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
- The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
Backend Context and Device Management:
- Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
- Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Testing

Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

Platform test-backend-ops full console output

Android test-backend-ops_all_android_ff033e1.log

Linux test-backend-ops_all_linux_ff033e1.log
Proper graph creation and execution paths are confirmed through detailed log messages.
Memory registration and cleanup within tensor binding functions have been thoroughly checked.
TODO: add benchmark result here

Current state

The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
Testing with llama3.2-1b/3b-f16/32 models yields expected results.
Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

Further feature support and device-specific optimizations are planned (see also the project backlog).
Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

… Direct) backend

…neously

…ously and thread safe

…ing to review comments

…lained in zhouwg#1

# Conflicts: # ggml/src/ggml-backend-reg.cpp

* more log * split graph implementation into cpp file * rename: ggml_qnn_graph -> qnn_graph * add imput/output tensor to graph * fix assert * wip * add _ggml_tensor field in qnn tensor * add comments * add set_data_buffer with raw memory buffer * use set_data_buffer * op param buffer use qnn_buffer_ptr * add qnn_mem_buffer_slice * use qnn_buffer_ptr as tensor buffer * use new set_data_buffer to reduce copy * ggml_qnn_op_config: add function to set input/output tensor before init node * remove ggml_qnn_connectable_op_config and use ggml_qnn_single_op_config instead * wip * add initialize_op_nodes without tensor params * wip * add op caps table * merge kGgmlOpToQnnOp and kOpCaps tables * wip * add cache parameter to create_tensors * add init_from_ggml_graph * disable gelu for all backend * wip * move op index calc to op config module * use the ggml_tensor as parameter of build_graph * add log * use create_operation_from_op_tensor in old build_graph function * remove unused constructors * fix parameter count * remove unused member func/var * make init_from_ggml_graph as a class member: build_graph_from_ggml_graph * move graph finalize into member function `finalize()` * get graph key from ggml op tensor directly * append output type * reduce tensor key length * add function to generate key from ggml_cgraph * simplify graph cache insert and delete * remove template param at get_qnn_graph_from_cache * wip * merge kQnnUnaryOpsTable and kQnnBinaryOpsTable * refactor device_supports_op * add log * wip * use framework function to check same shape * wip * extract some logic into separated function * wip * add execution function that runs graph * add function to create qnn graph from ggml_cgraph with cache * execute graph directly * return null graph key for empty graph * add more qualcomm chipset enums * add cap for reshape * disable some ops * try to skip GGML_OP_VIEW * moew log for view tensor * append param tensor into intermedia tensor key * use 'ordered' set * fix warning in release * wip

# Conflicts: # ggml/CMakeLists.txt # src/llama.cpp

* disable rpc buffer for npu * append input/output tensor size into unsupported op log * log dimensions for unsupported tensor * wip * split op config classes into separated file * fix reshape * wip * add op_constructor_with_type_param * set parameter for op_constructor_with_type_param func

* move qnn_instance function implementation into cpp * wip * wip * move dl related function into separated file * use cast op for gpu * Revert "use cast op for gpu" This reverts commit 05df736. * Reapply "use cast op for gpu" This reverts commit 2520e59. * fix compiling error in win * fix align_alloc in win * fix compiling error * add get sys free/total mem for win * wip * suppress warning in win * add missing chrono header * set the correct qnn lib name for windows * add flag to control cpu backend * wip * wip * Revert "Reapply "use cast op for gpu"" This reverts commit f56519c. * fix compiling error for linux build * fix cdsprpc dynamic library name * wip * skip rpc load fail * fix page_align_alloc * suppress some warning in gcc * wip * reuse align to function * more log * add log and fix warning * wip * fix asan errors and memory leaks * fix the get_io_tensors_from_graph * improve comment * print GGML_QNN_DEFAULT_LIB_SEARCH_PATH * revert some unused changes * move library search path setter into qnn module * fix android library loading * skip qnn_device_get_platform_info for npu emulator

zhouwg · 2025-02-25T07:26:48Z

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

chraac · 2025-02-25T07:31:56Z

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation.

ggml/src/ggml-qnn/.clang-format

zhouwg · 2025-02-25T07:56:39Z

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation.

I didn't provide any support to @chraac and his team. as I said before: I don't know this guy and his team and I'd like to see their success in this community. thanks so much.

chraac · 2025-02-25T08:10:13Z

ggml/src/ggml-qnn/graph.cpp

+}
+
+bool qnn_graph::build_graph_from_ggml_graph(const ggml_cgraph *cgraph) {
+    QNN_LOG_DEBUG("[%s][%s]build start", get_backend_name(_device), _graph_name.c_str());


here's how we map ggml_cgraph into a qnn graph

chraac · 2025-02-25T08:18:19Z

ggml/src/ggml-qnn/dl_loader.hpp

+    return reinterpret_cast<Fn>(dl_sym(handle, function_name));
+}
+
+} // namespace qnn


TODO: this dl_loader can be remove if upstream provide a unified dynamic load machanism

llama.cpp/ggml/src/ggml-backend-reg.cpp

Line 99 in 34a846b

static dl_handle * dl_load_library(const std::wstring & path) {

chraac · 2025-02-25T08:20:32Z

I didn't provide any support to @chraac and his team. as I said before: I don't know this guy and his team and I'd like to see their success in this community. thanks so much.

I'd like to rephrase my previous statement. I appreciate your earlier work, as my fork is based on your initial PR

oreomaker · 2025-02-25T10:10:17Z

ggml/src/ggml-qnn/tensor.hpp

+        }
+
+        if (_rpc_buffer) {
+            memcpy(_rpc_buffer->get_buffer(), _buffer->get_buffer(), _buffer->get_size());


Great effort! According to QNN Shared Memory Doc, the the _rpc_buffer in HTP can be directly accessed by CPU. Maybe there can be a no copy implementation.

Yeah, thank you for the reminder! current the rpc buffer is disabled:

bool should_use_mem_handle() const { // TODO: figure out how to set rpc mem to multiple tensor return false; }

thought we can reuse the rpc buffer for backing ggml tensor in the future, but now its disable by default

have an item in my project backlog here: https://github.com/users/chraac/projects/2/views/3?pane=issue&itemId=86454650

ggml/src/ggml-qnn/logger.cpp

https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

chraac · 2025-02-25T12:06:09Z

ggml/src/ggml-qnn/op-config-impl.cpp

+    return true;
+}
+
+bool ggml_qnn_matmul_op_config::create_mat_mul_nodes(QNNBackend device, Qnn_GraphHandle_t graph_handle, const int rank,


here's how we create corresponding mat_mul op, and the op will looks like:

which following ggml's guide line:
https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

chraac · 2025-02-25T16:33:22Z

ggml/src/ggml-qnn/backend-ops.cpp

+    output += ')';
+}
+
+void get_graph_key_from_cgraph(const ggml_cgraph *cgraph, std::string &output) {


Generates a unique key for a given ggml_cgraph. The key is constructed by concatenating the descriptions of the operations and their associated tensor dimensions within the graph.

Example key format: MUL_MATf32_256x16x10f32_256x1x10f32#LOG#ADD#ADDf32_16x1x10f32

May need some refactoring here to handle more complex graph structures and edge cases

* fix warning * wip * add todo for graph key generate * rename some file to meet upstream guideline * remove local .clang-format * expend supported/unsupported counter to all ops * append device name to log * port to ggml logger * fix warning after adapt to ggml logger * append \n to all log * use case op instead of convert * Revert "use case op instead of convert" This reverts commit e662fc2. * fix op that needs same shape * opt kQnnOpsTable * refresh params name field when getting op config * opt npu log print * remove unused functions

zhouwg and others added 30 commits April 24, 2024 16:28

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

b0c3013

… Direct) backend

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

d325088

… Direct) backend

rebase

c75817b

refine ggml-qnn-ut program and script to make reviewers happy

9c872cb

review: replace external declaration with NDK header file

926a866

add supportive of quantize data type Q8_0

dd29834

review: remove unused QNN helper functions

f4c5303

ggml-qnn: remove static global vars to support multi-instance simulta…

2fab33d

…neously

review: remove static global vars to support multi-instance simultane…

94ee775

…ously and thread safe

review: put qnn's internal log inside preprocessor diretive

5d691c6

review: code format using clang-format + manually modification accord…

fdf0272

…ing to review comments

review: fix a memory leak introduced by review modification which exp…

3e8b61f

…lained in zhouwg#1

npu: probe htp info and capacity of rpc ion memory

d38d4a6

ggml-qnn: refine source code of ggml-qnn.cpp to make reviewer more happy

5f8cfe4

ggml-qnn: refine ggml inference using QNN NPU

5269e08

ggml-qnn: refine ggml inference using QNN NPU

faaa86b

review: make a MVP(Minimum Viable PR) style PR in upstream

5598fbd

init the test array with const values

5e18cdc

add ggml_qnn_tensor_binder

6c68adc

use tensor wrapper in add

37bb926

use tensor wrapper in matmul

36e41a1

use ggml_qnn_tensor_reader for output tensor

a5679dd

use ggml_qnn_tensor_writer for all parameters

5fe7b87

rename

9456bba

fix todo

65a14d9

make the constant condition first

aeef0c6

remove TODO

dfe159f

split logger function, tensors and backend from main qnn source

9932062

remove reference of g_qnn_mgr in qnn_instance

3c491a3

fix compiling error

3fe07eb

chraac added 11 commits December 26, 2024 19:00

Merge branch 'master' into dev-refactoring

8f07b3e

# Conflicts: # ggml/src/ggml-backend-reg.cpp

Merge branch 'master' into dev-refactoring

c410717

# Conflicts: # ggml/CMakeLists.txt # src/llama.cpp

fix compiling error after merged

5f93376

Merge branch 'master' into dev-refactoring

3ed9f5b

Merge branch 'master' into dev-refactoring

34d9b38

Merge branch 'master' into dev-refactoring

ba324b0

Merge branch 'master' into dev-refactoring

12c75f1

Merge branch 'master' into dev-refactoring

84328ff

chraac marked this pull request as draft February 25, 2025 07:20

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Feb 25, 2025

chraac changed the title ~~[WIP][QNN] Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs~~ [WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs Feb 25, 2025

chraac commented Feb 25, 2025

View reviewed changes

ggml/src/ggml-qnn/.clang-format Outdated Show resolved Hide resolved

chraac commented Feb 25, 2025

View reviewed changes

oreomaker reviewed Feb 25, 2025

View reviewed changes

chraac commented Feb 25, 2025

View reviewed changes

ggml/src/ggml-qnn/logger.cpp Show resolved Hide resolved

chraac requested a review from oreomaker February 25, 2025 10:27

opt mulmat base on official doc (#25)

ff033e1

https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

chraac commented Feb 25, 2025

View reviewed changes

chraac mentioned this pull request Feb 27, 2025

[feat] fix some TODO item in upstream PR chraac/llama.cpp#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

chraac commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025

chraac Feb 25, 2025

chraac commented Feb 25, 2025 •

edited

Loading

oreomaker Feb 25, 2025

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025

Platform	test-backend-ops	full console output
Android		test-backend-ops_all_android_ff033e1.log
Linux		test-backend-ops_all_linux_ff033e1.log

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Are you sure you want to change the base?

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Conversation

chraac commented Feb 25, 2025 • edited Loading

Summary

Key Features and Improvements

Testing

Current state

Future development

zhouwg commented Feb 25, 2025 • edited Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 • edited Loading

chraac Feb 25, 2025

Choose a reason for hiding this comment

chraac Feb 25, 2025

Choose a reason for hiding this comment

chraac commented Feb 25, 2025 • edited Loading

oreomaker Feb 25, 2025

Choose a reason for hiding this comment

chraac Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

chraac Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

chraac Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

chraac Feb 25, 2025

Choose a reason for hiding this comment

chraac commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading