[SYCL] Fix reductions not working inside graph #50

Bensuo · 2022-12-08T15:39:15Z

Graph submission now properly creates a host visible event on the command list allowing auxilliary resources to be cleaned up.
Code is taken from _pi_queue::executeCommandList, copy pasting a decent amount of code like that is not ideal but there isn't an immediately obvious solution to let us actually use that function when flushing.

Updated to remove code duplication, instead relying on the OKtoBatchCommand parameter to executeCommandList to let us determine when to actually submit the command list and when to block execution.

This will at least make the examples that use reductions run, though it doesn't solve related problems with buffers and other regular operations on the queue.

Closes #24

- Graph submission now properly creates a host visible event on the command list allowing auxilliary resources to be cleaned up

reble · 2022-12-08T18:05:29Z

@Bensuo Great! Thanks for fixing that. We should just add a note with a TODO to remove the code duplication.

Now I'm wondering why we should not just call executeCommandList(CommandList,false,false) and make sure that we opt in the event creation when in lazy mode? Like:
if( !Device->useImmediateCommandLists() || Queue->Properties & PI_EXT_ONEAPI_QUEUE_LAZY_EXECUTION ) {

The idea was to decouple closing and executing the CommandList. But that is not what is happening now.

EwanC

Nice work figuring this out ⭐
Think we should merge this to fix current issues, and then try to rework as Pablo suggests as a follow-up

sycl/plugins/level_zero/pi_level_zero.cpp

Bensuo · 2022-12-09T14:47:23Z

@Bensuo Great! Thanks for fixing that. We should just add a note with a TODO to remove the code duplication.

Now I'm wondering why we should not just call executeCommandList(CommandList,false,false) and make sure that we opt in the event creation when in lazy mode? Like: if( !Device->useImmediateCommandLists() || Queue->Properties & PI_EXT_ONEAPI_QUEUE_LAZY_EXECUTION ) {

The idea was to decouple closing and executing the CommandList. But that is not what is happening now.

So I've been looking into this idea and I think I have something along those lines. I think we only need to modify the initial check in executeCommandList to also check if it's allowed to batch. We can somewhat assume that kernels will always be allowed to batch (and thus blocked from executing immediately) so if we're not allowed to batch we allow execution (so it works when called from the flush). It doesn't address any of the additional issues other than reductions and relying on kernels always being allowed to batch could be problematic if that changes in the future. It would remove all the code duplication from flush though.

I'll push it as an update to this PR but can always roll back to the already approved duplication approach if need be.

- Remove code duplication in piQueueFlush in favour of calling executeCommandList - executeCommandList slightly modified to block execution only for command lists not allowed to be batched.

…callback The `TypeSystemMap::m_mutex` guards against concurrent modifications of members of `TypeSystemMap`. In particular, `m_map`. `TypeSystemMap::ForEach` iterates through the entire `m_map` calling a user-specified callback for each entry. This is all done while `m_mutex` is locked. However, there's nothing that guarantees that the callback itself won't call back into `TypeSystemMap` APIs on the same thread. This lead to double-locking `m_mutex`, which is undefined behaviour. We've seen this cause a deadlock in the swift plugin with following backtrace: ``` int main() { std::unique_ptr<int> up = std::make_unique<int>(5); volatile int val = *up; return val; } clang++ -std=c++2a -g -O1 main.cpp ./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b ``` ``` frame #4: std::lock_guard<std::mutex>::lock_guard frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2 frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage ... frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths frame #27: lldb_private::SwiftASTContext::LoadModule frame #30: swift::ModuleDecl::collectLinkLibraries frame #31: lldb_private::SwiftASTContext::LoadModule frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState frame #38: lldb_private::Target::GetPersistentSymbol frame #41: lldb_private::TypeSystemMap::ForEach <<<< Lock #1 frame #42: lldb_private::Target::GetPersistentSymbol frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols frame #44: lldb_private::IRExecutionUnit::FindSymbol frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #48: llvm::LinkingSymbolResolver::findSymbol frame #49: llvm::LegacyJITSymbolResolver::lookup frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols frame #51: llvm::RuntimeDyldImpl::resolveRelocations frame #52: llvm::MCJIT::finalizeLoadedModules frame #53: llvm::MCJIT::finalizeObject frame #54: lldb_private::IRExecutionUnit::ReportAllocations frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo frame #56: lldb_private::ClangExpressionParser::PrepareForExecution frame #57: lldb_private::ClangUserExpression::TryParse frame #58: lldb_private::ClangUserExpression::Parse ``` Our solution is to simply iterate over a local copy of `m_map`. **Testing** * Confirmed on manual reproducer (would reproduce 100% of the time before the patch) Differential Revision: https://reviews.llvm.org/D149949

[SYCL] Fix reductions not working inside graph

f195d51

- Graph submission now properly creates a host visible event on the command list allowing auxilliary resources to be cleaned up

Bensuo requested review from EwanC, reble and julianmi December 8, 2022 15:39

reble approved these changes Dec 8, 2022

View reviewed changes

EwanC approved these changes Dec 9, 2022

View reviewed changes

sycl/plugins/level_zero/pi_level_zero.cpp Outdated Show resolved Hide resolved

[SYCL] Add todo comment to remove code duplication in piQueueFlush

ca55c3d

EwanC added the Graph Implementation Related to DPC++ implementation and testing label Dec 9, 2022

[SYCL] Remove code duplication in piQueueFlush

fd0d114

- Remove code duplication in piQueueFlush in favour of calling executeCommandList - executeCommandList slightly modified to block execution only for command lists not allowed to be batched.

EwanC approved these changes Dec 12, 2022

View reviewed changes

Bensuo mentioned this pull request Dec 12, 2022

[SYCL] Enable submitting sub-graphs #55

Merged

Bensuo requested a review from reble December 13, 2022 10:55

EwanC merged commit c99bdca into sycl-graph-poc-v2 Dec 13, 2022

Bensuo deleted the ben/reduction-fix branch August 7, 2023 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Fix reductions not working inside graph #50

[SYCL] Fix reductions not working inside graph #50

Bensuo commented Dec 8, 2022 •

edited

Loading

reble commented Dec 8, 2022

EwanC left a comment

Bensuo commented Dec 9, 2022 •

edited

Loading

[SYCL] Fix reductions not working inside graph #50

[SYCL] Fix reductions not working inside graph #50

Conversation

Bensuo commented Dec 8, 2022 • edited Loading

reble commented Dec 8, 2022

EwanC left a comment

Choose a reason for hiding this comment

Bensuo commented Dec 9, 2022 • edited Loading

Bensuo commented Dec 8, 2022 •

edited

Loading

Bensuo commented Dec 9, 2022 •

edited

Loading