Nodes captured or added to a graph can fail due to captured variables going out of scope #49

Bensuo · 2022-12-07T15:35:40Z

Because of the delayed submission of the lambda functions passed to the sycl::queue in either mode of graph construction it is possible for a situation to arise where the lambda captures variables which are alive at time of submission to the queue but out of scope at time of graph execution, which causes undefined behaviour and errors when executing a graph. For example:

void run_some_kernel(sycl::queue q, int* data){
  // data is captured by ref here but will have gone out of scope when the 
  // CGF is later run when the graph is executed.
  q.submit([&](sycl::handler& h){
    // Call some kernel which uses data here
  });
}
int main(){

sycl::queue q;

int* someUsmAllocatedPtr;
q.begin_recording();
run_some_kernel(q, someUsmAllocatedPtr);
q.end_recording();
}

This scenario is potentially more common with record and replay but not improbable with the explicit API.

The text was updated successfully, but these errors were encountered:

Bensuo · 2023-01-11T17:47:27Z

When investigating using record and replay with oneDNN this problem can also be seen. Where kernels are actually submitted within oneDNN, in this code, many variables are captured and used in the CGF which will go out of scope before graph execution time.

In some cases this can be avoided by capturing everything by value but for example in this case the kernel_arg_list_t type explicitly cannot be copied so even this doesn't help. This is mostly mentioned as an interesting note because obviously it is not desirable to change the lambda capture to by value in most cases.

Introduce a test case which fails before this commit and passes afterwards. Based on #49

EwanC · 2023-03-24T10:50:43Z

Closing as fixed in #87

…callback The `TypeSystemMap::m_mutex` guards against concurrent modifications of members of `TypeSystemMap`. In particular, `m_map`. `TypeSystemMap::ForEach` iterates through the entire `m_map` calling a user-specified callback for each entry. This is all done while `m_mutex` is locked. However, there's nothing that guarantees that the callback itself won't call back into `TypeSystemMap` APIs on the same thread. This lead to double-locking `m_mutex`, which is undefined behaviour. We've seen this cause a deadlock in the swift plugin with following backtrace: ``` int main() { std::unique_ptr<int> up = std::make_unique<int>(5); volatile int val = *up; return val; } clang++ -std=c++2a -g -O1 main.cpp ./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b ``` ``` frame #4: std::lock_guard<std::mutex>::lock_guard frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2 frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage ... frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths frame #27: lldb_private::SwiftASTContext::LoadModule frame #30: swift::ModuleDecl::collectLinkLibraries frame #31: lldb_private::SwiftASTContext::LoadModule frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState frame #38: lldb_private::Target::GetPersistentSymbol frame #41: lldb_private::TypeSystemMap::ForEach <<<< Lock #1 frame #42: lldb_private::Target::GetPersistentSymbol frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols frame #44: lldb_private::IRExecutionUnit::FindSymbol frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #48: llvm::LinkingSymbolResolver::findSymbol frame #49: llvm::LegacyJITSymbolResolver::lookup frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols frame #51: llvm::RuntimeDyldImpl::resolveRelocations frame #52: llvm::MCJIT::finalizeLoadedModules frame #53: llvm::MCJIT::finalizeObject frame #54: lldb_private::IRExecutionUnit::ReportAllocations frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo frame #56: lldb_private::ClangExpressionParser::PrepareForExecution frame #57: lldb_private::ClangUserExpression::TryParse frame #58: lldb_private::ClangUserExpression::Parse ``` Our solution is to simply iterate over a local copy of `m_map`. **Testing** * Confirmed on manual reproducer (would reproduce 100% of the time before the patch) Differential Revision: https://reviews.llvm.org/D149949

Bensuo added the bug Something isn't working label Dec 7, 2022

EwanC added the Graph Implementation Related to DPC++ implementation and testing label Dec 8, 2022

EwanC assigned Bensuo Feb 16, 2023

EwanC mentioned this issue Mar 7, 2023

[SYCL] Remove CGF reuse in graph nodes #87

Merged

EwanC added a commit that referenced this issue Mar 17, 2023

Add test case

e638c37

Introduce a test case which fails before this commit and passes afterwards. Based on #49

EwanC closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes captured or added to a graph can fail due to captured variables going out of scope #49

Nodes captured or added to a graph can fail due to captured variables going out of scope #49

Bensuo commented Dec 7, 2022

Bensuo commented Jan 11, 2023 •

edited

Loading

EwanC commented Mar 24, 2023

Nodes captured or added to a graph can fail due to captured variables going out of scope #49

Nodes captured or added to a graph can fail due to captured variables going out of scope #49

Comments

Bensuo commented Dec 7, 2022

Bensuo commented Jan 11, 2023 • edited Loading

EwanC commented Mar 24, 2023

Bensuo commented Jan 11, 2023 •

edited

Loading