Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Draft
wants to merge 169 commits into
base: master
Choose a base branch
from

Conversation

chraac
Copy link

@chraac chraac commented Feb 25, 2025

Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1
Loading
  1. GGML Adaptation Layer

    • Graph Caching, Mapping, and Execution:

      • Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
      • Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
      • Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
    • Tensor Binding and Execution Flow:

      • Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
      • Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
  2. QNN Object Layer

    • QNN System and Instance Management:

      • Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
      • Manages QNN instance creation and initialization via the qnn_instance class
      • Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
      • Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
    • Dynamic Resource Handling:

      • Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
      • Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
  3. Utility Layer

    • Dynamic Library Loading & Search Path Management:

      • Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
      • Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
    • General Utilities:

      • Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

  • Graph Mapping Mechanism:

    • Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
    • Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
    • The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
  • Backend Context and Device Management:

    • Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
    • Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Testing

  • Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

    Platform test-backend-ops full console output
    Android 2ac8fce111ee0047a5a8b43808047ff2 test-backend-ops_all_android_ff033e1.log
    Linux image test-backend-ops_all_linux_ff033e1.log
  • Proper graph creation and execution paths are confirmed through detailed log messages.

  • Memory registration and cleanup within tensor binding functions have been thoroughly checked.

  • TODO: add benchmark result here

Current state

  • The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
  • Testing with llama3.2-1b/3b-f16/32 models yields expected results.
  • Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

  • Further feature support and device-specific optimizations are planned (see also the project backlog).
  • Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

zhouwg and others added 30 commits April 24, 2024 16:28
# Conflicts:
#	ggml/src/ggml-backend-reg.cpp
* more log

* split graph implementation into cpp file

* rename: ggml_qnn_graph -> qnn_graph

* add imput/output tensor to graph

* fix assert

* wip

* add _ggml_tensor field in qnn tensor

* add comments

* add set_data_buffer with raw memory buffer

* use set_data_buffer

* op param buffer use qnn_buffer_ptr

* add qnn_mem_buffer_slice

* use qnn_buffer_ptr as tensor buffer

* use new set_data_buffer to reduce copy

* ggml_qnn_op_config: add function to set input/output tensor before init node

* remove ggml_qnn_connectable_op_config and use ggml_qnn_single_op_config instead

* wip

* add initialize_op_nodes without tensor params

* wip

* add op caps table

* merge kGgmlOpToQnnOp and kOpCaps tables

* wip

* add cache parameter to create_tensors

* add init_from_ggml_graph

* disable gelu for all backend

* wip

* move op index calc to op config module

* use the ggml_tensor as parameter of build_graph

* add log

* use create_operation_from_op_tensor in old build_graph function

* remove unused constructors

* fix parameter count

* remove unused member func/var

* make init_from_ggml_graph as a class member: build_graph_from_ggml_graph

* move graph finalize into member function `finalize()`

* get graph key from ggml op tensor directly

* append output type

* reduce tensor key length

* add function to generate key from ggml_cgraph

* simplify graph cache insert and delete

* remove template param at get_qnn_graph_from_cache

* wip

* merge kQnnUnaryOpsTable and kQnnBinaryOpsTable

* refactor device_supports_op

* add log

* wip

* use framework function to check same shape

* wip

* extract some logic into separated function

* wip

* add execution function that runs graph

* add function to create qnn graph from ggml_cgraph with cache

* execute graph directly

* return null graph key for empty graph

* add more qualcomm chipset enums

* add cap for reshape

* disable some ops

* try to skip GGML_OP_VIEW

* moew log for view tensor

* append param tensor into intermedia tensor key

* use 'ordered' set

* fix warning in release

* wip
# Conflicts:
#	ggml/CMakeLists.txt
#	src/llama.cpp
* disable rpc buffer for npu

* append input/output tensor size into unsupported op log

* log dimensions for unsupported tensor

* wip

* split op config classes into separated file

* fix reshape

* wip

* add op_constructor_with_type_param

* set parameter for op_constructor_with_type_param func
* move qnn_instance function implementation into cpp

* wip

* wip

* move dl related function into separated file

* use cast op for gpu

* Revert "use cast op for gpu"

This reverts commit 05df736.

* Reapply "use cast op for gpu"

This reverts commit 2520e59.

* fix compiling error in win

* fix align_alloc in win

* fix compiling error

* add get sys free/total mem for win

* wip

* suppress warning in win

* add missing chrono header

* set the correct qnn lib name for windows

* add flag to control cpu backend

* wip

* wip

* Revert "Reapply "use cast op for gpu""

This reverts commit f56519c.

* fix compiling error for linux build

* fix cdsprpc dynamic library name

* wip

* skip rpc load fail

* fix page_align_alloc

* suppress some warning in gcc

* wip

* reuse align to function

* more log

* add log and fix warning

* wip

* fix asan errors and memory leaks

* fix the get_io_tensors_from_graph

* improve comment

* print GGML_QNN_DEFAULT_LIB_SEARCH_PATH

* revert some unused changes

* move library search path setter into qnn module

* fix android library loading

* skip qnn_device_get_platform_info for npu emulator
@chraac chraac marked this pull request as draft February 25, 2025 07:20
@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Feb 25, 2025
@chraac chraac changed the title [WIP][QNN] Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs [WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs Feb 25, 2025
@zhouwg
Copy link
Contributor

zhouwg commented Feb 25, 2025

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

@chraac
Copy link
Author

chraac commented Feb 25, 2025

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation.

@zhouwg
Copy link
Contributor

zhouwg commented Feb 25, 2025

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation.

I didn't provide any support to @chraac and his team. as I said before: I don't know this guy and his team and I'd like to see their success in this community. thanks so much.

}

bool qnn_graph::build_graph_from_ggml_graph(const ggml_cgraph *cgraph) {
QNN_LOG_DEBUG("[%s][%s]build start", get_backend_name(_device), _graph_name.c_str());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's how we map ggml_cgraph into a qnn graph

return reinterpret_cast<Fn>(dl_sym(handle, function_name));
}

} // namespace qnn
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: this dl_loader can be remove if upstream provide a unified dynamic load machanism

static dl_handle * dl_load_library(const std::wstring & path) {

@chraac
Copy link
Author

chraac commented Feb 25, 2025

I didn't provide any support to @chraac and his team. as I said before: I don't know this guy and his team and I'd like to see their success in this community. thanks so much.

I'd like to rephrase my previous statement. I appreciate your earlier work, as my fork is based on your initial PR

}

if (_rpc_buffer) {
memcpy(_rpc_buffer->get_buffer(), _buffer->get_buffer(), _buffer->get_size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort! According to QNN Shared Memory Doc, the the _rpc_buffer in HTP can be directly accessed by CPU. Maybe there can be a no copy implementation.

Copy link
Author

@chraac chraac Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thank you for the reminder! current the rpc buffer is disabled:

    bool should_use_mem_handle() const {
        // TODO: figure out how to set rpc mem to multiple tensor
        return false;
    }

thought we can reuse the rpc buffer for backing ggml tensor in the future, but now its disable by default

Copy link
Author

@chraac chraac Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chraac chraac requested a review from oreomaker February 25, 2025 10:27
return true;
}

bool ggml_qnn_matmul_op_config::create_mat_mul_nodes(QNNBackend device, Qnn_GraphHandle_t graph_handle, const int rank,
Copy link
Author

@chraac chraac Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's how we create corresponding mat_mul op, and the op will looks like:
image

which following ggml's guide line:
https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
image

output += ')';
}

void get_graph_key_from_cgraph(const ggml_cgraph *cgraph, std::string &output) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generates a unique key for a given ggml_cgraph. The key is constructed by concatenating the descriptions of the operations and their associated tensor dimensions within the graph.

Example key format: MUL_MATf32_256x16x10f32_256x1x10f32#LOG#ADD#ADDf32_16x1x10f32

May need some refactoring here to handle more complex graph structures and edge cases

* fix warning

* wip

* add todo for graph key generate

* rename some file to meet upstream guideline

* remove local .clang-format

* expend supported/unsupported counter to all ops

* append device name to log

* port to ggml logger

* fix warning after adapt to ggml logger

* append \n to all log

* use case op instead of convert

* Revert "use case op instead of convert"

This reverts commit e662fc2.

* fix op that needs same shape

* opt kQnnOpsTable

* refresh params name field when getting op config

* opt npu log print

* remove unused functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants