Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes captured or added to a graph can fail due to captured variables going out of scope #49

Closed
Bensuo opened this issue Dec 7, 2022 · 2 comments
Assignees
Labels
bug Something isn't working Graph Implementation Related to DPC++ implementation and testing

Comments

@Bensuo
Copy link
Collaborator

Bensuo commented Dec 7, 2022

Because of the delayed submission of the lambda functions passed to the sycl::queue in either mode of graph construction it is possible for a situation to arise where the lambda captures variables which are alive at time of submission to the queue but out of scope at time of graph execution, which causes undefined behaviour and errors when executing a graph. For example:

void run_some_kernel(sycl::queue q, int* data){
  // data is captured by ref here but will have gone out of scope when the 
  // CGF is later run when the graph is executed.
  q.submit([&](sycl::handler& h){
    // Call some kernel which uses data here
  });
}
int main(){

sycl::queue q;

int* someUsmAllocatedPtr;
q.begin_recording();
run_some_kernel(q, someUsmAllocatedPtr);
q.end_recording();
}

This scenario is potentially more common with record and replay but not improbable with the explicit API.

@Bensuo Bensuo added the bug Something isn't working label Dec 7, 2022
@EwanC EwanC added the Graph Implementation Related to DPC++ implementation and testing label Dec 8, 2022
@Bensuo
Copy link
Collaborator Author

Bensuo commented Jan 11, 2023

When investigating using record and replay with oneDNN this problem can also be seen. Where kernels are actually submitted within oneDNN, in this code, many variables are captured and used in the CGF which will go out of scope before graph execution time.

In some cases this can be avoided by capturing everything by value but for example in this case the kernel_arg_list_t type explicitly cannot be copied so even this doesn't help. This is mostly mentioned as an interesting note because obviously it is not desirable to change the lambda capture to by value in most cases.

EwanC added a commit that referenced this issue Mar 17, 2023
Introduce a test case which fails before this commit
and passes afterwards. Based on #49
@EwanC
Copy link
Collaborator

EwanC commented Mar 24, 2023

Closing as fixed in #87

@EwanC EwanC closed this as completed Mar 24, 2023
reble pushed a commit that referenced this issue May 15, 2023
…callback

The `TypeSystemMap::m_mutex` guards against concurrent modifications
of members of `TypeSystemMap`. In particular, `m_map`.

`TypeSystemMap::ForEach` iterates through the entire `m_map` calling
a user-specified callback for each entry. This is all done while
`m_mutex` is locked. However, there's nothing that guarantees that
the callback itself won't call back into `TypeSystemMap` APIs on the
same thread. This lead to double-locking `m_mutex`, which is undefined
behaviour. We've seen this cause a deadlock in the swift plugin with
following backtrace:

```

int main() {
    std::unique_ptr<int> up = std::make_unique<int>(5);

    volatile int val = *up;
    return val;
}

clang++ -std=c++2a -g -O1 main.cpp

./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b
```

```
frame #4: std::lock_guard<std::mutex>::lock_guard
frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2
frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage
frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage
...
frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths
frame #27: lldb_private::SwiftASTContext::LoadModule
frame #30: swift::ModuleDecl::collectLinkLibraries
frame #31: lldb_private::SwiftASTContext::LoadModule
frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl
frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports
frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext
frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState
frame #38: lldb_private::Target::GetPersistentSymbol
frame #41: lldb_private::TypeSystemMap::ForEach                 <<<< Lock #1
frame #42: lldb_private::Target::GetPersistentSymbol
frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols
frame #44: lldb_private::IRExecutionUnit::FindSymbol
frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence
frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #48: llvm::LinkingSymbolResolver::findSymbol
frame #49: llvm::LegacyJITSymbolResolver::lookup
frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols
frame #51: llvm::RuntimeDyldImpl::resolveRelocations
frame #52: llvm::MCJIT::finalizeLoadedModules
frame #53: llvm::MCJIT::finalizeObject
frame #54: lldb_private::IRExecutionUnit::ReportAllocations
frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo
frame #56: lldb_private::ClangExpressionParser::PrepareForExecution
frame #57: lldb_private::ClangUserExpression::TryParse
frame #58: lldb_private::ClangUserExpression::Parse
```

Our solution is to simply iterate over a local copy of `m_map`.

**Testing**

* Confirmed on manual reproducer (would reproduce 100% of the time
  before the patch)

Differential Revision: https://reviews.llvm.org/D149949
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Graph Implementation Related to DPC++ implementation and testing
Projects
None yet
Development

No branches or pull requests

2 participants