SYCL Graph emulation mode #56

julianmi · 2022-12-13T18:23:58Z

This PR proposes an emulation mode for SYCL Graph. This provides the APIs of the sycl-graph-poc-v2 branch. The execution of the graph is however emulated by submitting each graph node when executing the graph (repeatedly). This addresses #20.

This provides the first step of merging the SYCL Graph POC into the mainline. The proposed process can be found here: intel#7627 (comment)

EwanC · 2022-12-14T10:02:16Z

sycl/plugins/level_zero/pi_level_zero.cpp

@@ -1307,44 +1307,6 @@ pi_result _pi_context::getAvailableCommandList(
    pi_queue Queue, pi_command_list_ptr_t &CommandList, bool UseCopyEngine,
    bool AllowBatching, ze_command_queue_handle_t *ForcedCmdQueue) {

-  // This is a hack. TODO: Proper CommandList allocation per Executable Graph.


For this emulation mode to be most useful, in my mind it wouldn't live in a separate branch that we would need to keep up-to-date with the PoC branch, but live in the PoC branch itself.

Rather than removing this code, have you considered making emulation mode an environment variable users can set to ON?

Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.

Ahh, that makes sense as the first part of the 3 stage approach. I'm a bit wary of us trying to maintain a lot of branches, so I think we should sync up on our call later about how soon we want to open that stage 1 PR. If it's ASAP then using sycl-graph-poc-emulation to do that I think is good. However, if it's more delayed and more changes start getting merged into the PoC branch, then maintaining sycl-graph-poc-emulation will be an overhead, and it might just be better to have the path in the PoC branch guarded by a preprocessor macro. Then, once we want to open a DPC++ PR with just stage 1, we can easily remove the macro and just keep the emulation path.

Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.

I think the 3-step merge process referred to in the comment on the POC PR is for a future final implementation and not for the POC (which this code is based on). Unless we are planning to close that open PR and resubmit this as the first step, I agree with the idea of having this be optional functionality in the POC guarded by a macro.

I will close this in favor of a version that is guarded by the SYCL Graph macro as discussed. I will further split the testing part of this PR into a separate PR.

…callback The `TypeSystemMap::m_mutex` guards against concurrent modifications of members of `TypeSystemMap`. In particular, `m_map`. `TypeSystemMap::ForEach` iterates through the entire `m_map` calling a user-specified callback for each entry. This is all done while `m_mutex` is locked. However, there's nothing that guarantees that the callback itself won't call back into `TypeSystemMap` APIs on the same thread. This lead to double-locking `m_mutex`, which is undefined behaviour. We've seen this cause a deadlock in the swift plugin with following backtrace: ``` int main() { std::unique_ptr<int> up = std::make_unique<int>(5); volatile int val = *up; return val; } clang++ -std=c++2a -g -O1 main.cpp ./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b ``` ``` frame #4: std::lock_guard<std::mutex>::lock_guard frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2 frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage ... frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths frame #27: lldb_private::SwiftASTContext::LoadModule frame #30: swift::ModuleDecl::collectLinkLibraries frame #31: lldb_private::SwiftASTContext::LoadModule frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState frame #38: lldb_private::Target::GetPersistentSymbol frame #41: lldb_private::TypeSystemMap::ForEach <<<< Lock #1 frame #42: lldb_private::Target::GetPersistentSymbol frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols frame #44: lldb_private::IRExecutionUnit::FindSymbol frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #48: llvm::LinkingSymbolResolver::findSymbol frame #49: llvm::LegacyJITSymbolResolver::lookup frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols frame #51: llvm::RuntimeDyldImpl::resolveRelocations frame #52: llvm::MCJIT::finalizeLoadedModules frame #53: llvm::MCJIT::finalizeObject frame #54: lldb_private::IRExecutionUnit::ReportAllocations frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo frame #56: lldb_private::ClangExpressionParser::PrepareForExecution frame #57: lldb_private::ClangUserExpression::TryParse frame #58: lldb_private::ClangUserExpression::Parse ``` Our solution is to simply iterate over a local copy of `m_map`. **Testing** * Confirmed on manual reproducer (would reproduce 100% of the time before the patch) Differential Revision: https://reviews.llvm.org/D149949

Currently, process of replacing bitwise operations consisting of `LSR`/`LSL` with `And` is performed by `DAGCombiner`. However, in certain cases, the `AND` generated by this process can be removed. Consider following case: ``` lsr x8, x8, #56 and x8, x8, #0xfc ldr w0, [x2, x8] ret ``` In this case, we can remove the `AND` by changing the target of `LDR` to `[X2, X8, LSL #2]` and right-shifting amount change to 56 to 58. after changed: ``` lsr x8, x8, #58 ldr w0, [x2, x8, lsl #2] ret ``` This patch checks to see if the `SHIFTING` + `AND` operation on load target can be optimized and optimizes it if it can.

julianmi added 2 commits December 13, 2022 19:14

Emulate graph execution

f4e64cb

Extend testing

f0a4476

julianmi added the Graph Implementation Related to DPC++ implementation and testing label Dec 13, 2022

julianmi requested review from EwanC, reble and Bensuo December 13, 2022 18:23

EwanC reviewed Dec 14, 2022

View reviewed changes

julianmi closed this Jan 20, 2023

julianmi mentioned this pull request Jan 20, 2023

Guard SYCL Graph implementation and fallback emulation #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL Graph emulation mode #56

SYCL Graph emulation mode #56

julianmi commented Dec 13, 2022

EwanC Dec 14, 2022

julianmi Dec 14, 2022

EwanC Dec 14, 2022

Bensuo Dec 15, 2022

julianmi Jan 20, 2023

SYCL Graph emulation mode #56

SYCL Graph emulation mode #56

Conversation

julianmi commented Dec 13, 2022

EwanC Dec 14, 2022

Choose a reason for hiding this comment

julianmi Dec 14, 2022

Choose a reason for hiding this comment

EwanC Dec 14, 2022

Choose a reason for hiding this comment

Bensuo Dec 15, 2022

Choose a reason for hiding this comment

julianmi Jan 20, 2023

Choose a reason for hiding this comment