-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063
base: master
Are you sure you want to change the base?
Conversation
…ously and thread safe
…ing to review comments
# Conflicts: # ggml/src/ggml-backend-reg.cpp
* more log * split graph implementation into cpp file * rename: ggml_qnn_graph -> qnn_graph * add imput/output tensor to graph * fix assert * wip * add _ggml_tensor field in qnn tensor * add comments * add set_data_buffer with raw memory buffer * use set_data_buffer * op param buffer use qnn_buffer_ptr * add qnn_mem_buffer_slice * use qnn_buffer_ptr as tensor buffer * use new set_data_buffer to reduce copy * ggml_qnn_op_config: add function to set input/output tensor before init node * remove ggml_qnn_connectable_op_config and use ggml_qnn_single_op_config instead * wip * add initialize_op_nodes without tensor params * wip * add op caps table * merge kGgmlOpToQnnOp and kOpCaps tables * wip * add cache parameter to create_tensors * add init_from_ggml_graph * disable gelu for all backend * wip * move op index calc to op config module * use the ggml_tensor as parameter of build_graph * add log * use create_operation_from_op_tensor in old build_graph function * remove unused constructors * fix parameter count * remove unused member func/var * make init_from_ggml_graph as a class member: build_graph_from_ggml_graph * move graph finalize into member function `finalize()` * get graph key from ggml op tensor directly * append output type * reduce tensor key length * add function to generate key from ggml_cgraph * simplify graph cache insert and delete * remove template param at get_qnn_graph_from_cache * wip * merge kQnnUnaryOpsTable and kQnnBinaryOpsTable * refactor device_supports_op * add log * wip * use framework function to check same shape * wip * extract some logic into separated function * wip * add execution function that runs graph * add function to create qnn graph from ggml_cgraph with cache * execute graph directly * return null graph key for empty graph * add more qualcomm chipset enums * add cap for reshape * disable some ops * try to skip GGML_OP_VIEW * moew log for view tensor * append param tensor into intermedia tensor key * use 'ordered' set * fix warning in release * wip
# Conflicts: # ggml/CMakeLists.txt # src/llama.cpp
* disable rpc buffer for npu * append input/output tensor size into unsupported op log * log dimensions for unsupported tensor * wip * split op config classes into separated file * fix reshape * wip * add op_constructor_with_type_param * set parameter for op_constructor_with_type_param func
* move qnn_instance function implementation into cpp * wip * wip * move dl related function into separated file * use cast op for gpu * Revert "use cast op for gpu" This reverts commit 05df736. * Reapply "use cast op for gpu" This reverts commit 2520e59. * fix compiling error in win * fix align_alloc in win * fix compiling error * add get sys free/total mem for win * wip * suppress warning in win * add missing chrono header * set the correct qnn lib name for windows * add flag to control cpu backend * wip * wip * Revert "Reapply "use cast op for gpu"" This reverts commit f56519c. * fix compiling error for linux build * fix cdsprpc dynamic library name * wip * skip rpc load fail * fix page_align_alloc * suppress some warning in gcc * wip * reuse align to function * more log * add log and fix warning * wip * fix asan errors and memory leaks * fix the get_io_tensors_from_graph * improve comment * print GGML_QNN_DEFAULT_LIB_SEARCH_PATH * revert some unused changes * move library search path setter into qnn module * fix android library loading * skip qnn_device_get_platform_info for npu emulator
I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks. |
Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation. |
I didn't provide any support to @chraac and his team. as I said before: I don't know this guy and his team and I'd like to see their success in this community. thanks so much. |
ggml/src/ggml-qnn/graph.cpp
Outdated
} | ||
|
||
bool qnn_graph::build_graph_from_ggml_graph(const ggml_cgraph *cgraph) { | ||
QNN_LOG_DEBUG("[%s][%s]build start", get_backend_name(_device), _graph_name.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's how we map ggml_cgraph
into a qnn graph
ggml/src/ggml-qnn/dl_loader.hpp
Outdated
return reinterpret_cast<Fn>(dl_sym(handle, function_name)); | ||
} | ||
|
||
} // namespace qnn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: this dl_loader
can be remove if upstream provide a unified dynamic load machanism
llama.cpp/ggml/src/ggml-backend-reg.cpp
Line 99 in 34a846b
static dl_handle * dl_load_library(const std::wstring & path) { |
I'd like to rephrase my previous statement. I appreciate your earlier work, as my fork is based on your initial PR |
} | ||
|
||
if (_rpc_buffer) { | ||
memcpy(_rpc_buffer->get_buffer(), _buffer->get_buffer(), _buffer->get_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great effort! According to QNN Shared Memory Doc, the the _rpc_buffer in HTP can be directly accessed by CPU. Maybe there can be a no copy implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thank you for the reminder! current the rpc buffer is disabled:
bool should_use_mem_handle() const {
// TODO: figure out how to set rpc mem to multiple tensor
return false;
}
thought we can reuse the rpc buffer for backing ggml tensor in the future, but now its disable by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have an item in my project backlog here: https://github.com/users/chraac/projects/2/views/3?pane=issue&itemId=86454650
ggml/src/ggml-qnn/op-config-impl.cpp
Outdated
return true; | ||
} | ||
|
||
bool ggml_qnn_matmul_op_config::create_mat_mul_nodes(QNNBackend device, Qnn_GraphHandle_t graph_handle, const int rank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's how we create corresponding mat_mul
op, and the op will looks like:
which following ggml's guide line:
https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
ggml/src/ggml-qnn/backend-ops.cpp
Outdated
output += ')'; | ||
} | ||
|
||
void get_graph_key_from_cgraph(const ggml_cgraph *cgraph, std::string &output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generates a unique key for a given ggml_cgraph
. The key is constructed by concatenating the descriptions of the operations and their associated tensor dimensions within the graph.
Example key format: MUL_MATf32_256x16x10f32_256x1x10f32#LOG#ADD#ADDf32_16x1x10f32
May need some refactoring here to handle more complex graph structures and edge cases
* fix warning * wip * add todo for graph key generate * rename some file to meet upstream guideline * remove local .clang-format * expend supported/unsupported counter to all ops * append device name to log * port to ggml logger * fix warning after adapt to ggml logger * append \n to all log * use case op instead of convert * Revert "use case op instead of convert" This reverts commit e662fc2. * fix op that needs same shape * opt kQnnOpsTable * refresh params name field when getting op config * opt npu log print * remove unused functions
Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines
Summary
This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.
This backend is organized into three distinct integration layers:
GGML Adaptation Layer
Graph Caching, Mapping, and Execution:
backend-ops.cpp
) to minimize redundant graph creation and boost execution performance.op-config-caps.cpp
andop-config-impl.cpp
).Tensor Binding and Execution Flow:
tensor.hpp
andgraph.hpp
), managing both host and RPC memory via buffer interfaces likeqnn_buffer_interface
.QNN Object Layer
QNN System and Instance Management:
qnn_system_interface
class, originally derived from executorch, to create and free the QNN system context.qnn_instance
classload_backend()
andload_system()
) that retrieve provider lists and choose valid QNN interfaces based on API version checks.Dynamic Resource Handling:
load_lib_with_fallback()
to reliably load both the system and RPC libraries.Utility Layer
Dynamic Library Loading & Search Path Management:
qnn-lib.cpp
to manage dynamic library loading with fallbacks.insert_path()
andset_qnn_lib_search_path()
to configure environment variables (likeLD_LIBRARY_PATH
on Linux andADSP_LIBRARY_PATH
on Android) based on a custom library search path.General Utilities:
Key Features and Improvements
Graph Mapping Mechanism:
Backend Context and Device Management:
Testing
Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using
test-backend-ops
—this is integrated into the pipeline for each commit node of thedev-refactoring
branch.Proper graph creation and execution paths are confirmed through detailed log messages.
Memory registration and cleanup within tensor binding functions have been thoroughly checked.
TODO: add benchmark result here
Current state
Future development