-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL Graph emulation mode #56
Conversation
@@ -1307,44 +1307,6 @@ pi_result _pi_context::getAvailableCommandList( | |||
pi_queue Queue, pi_command_list_ptr_t &CommandList, bool UseCopyEngine, | |||
bool AllowBatching, ze_command_queue_handle_t *ForcedCmdQueue) { | |||
|
|||
// This is a hack. TODO: Proper CommandList allocation per Executable Graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this emulation mode to be most useful, in my mind it wouldn't live in a separate branch that we would need to keep up-to-date with the PoC branch, but live in the PoC branch itself.
Rather than removing this code, have you considered making emulation mode an environment variable users can set to ON?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, that makes sense as the first part of the 3 stage approach. I'm a bit wary of us trying to maintain a lot of branches, so I think we should sync up on our call later about how soon we want to open that stage 1 PR. If it's ASAP then using sycl-graph-poc-emulation
to do that I think is good. However, if it's more delayed and more changes start getting merged into the PoC branch, then maintaining sycl-graph-poc-emulation
will be an overhead, and it might just be better to have the path in the PoC branch guarded by a preprocessor macro. Then, once we want to open a DPC++ PR with just stage 1, we can easily remove the macro and just keep the emulation path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.
I think the 3-step merge process referred to in the comment on the POC PR is for a future final implementation and not for the POC (which this code is based on). Unless we are planning to close that open PR and resubmit this as the first step, I agree with the idea of having this be optional functionality in the POC guarded by a macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will close this in favor of a version that is guarded by the SYCL Graph macro as discussed. I will further split the testing part of this PR into a separate PR.
…callback The `TypeSystemMap::m_mutex` guards against concurrent modifications of members of `TypeSystemMap`. In particular, `m_map`. `TypeSystemMap::ForEach` iterates through the entire `m_map` calling a user-specified callback for each entry. This is all done while `m_mutex` is locked. However, there's nothing that guarantees that the callback itself won't call back into `TypeSystemMap` APIs on the same thread. This lead to double-locking `m_mutex`, which is undefined behaviour. We've seen this cause a deadlock in the swift plugin with following backtrace: ``` int main() { std::unique_ptr<int> up = std::make_unique<int>(5); volatile int val = *up; return val; } clang++ -std=c++2a -g -O1 main.cpp ./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b ``` ``` frame #4: std::lock_guard<std::mutex>::lock_guard frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2 frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage ... frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths frame #27: lldb_private::SwiftASTContext::LoadModule frame #30: swift::ModuleDecl::collectLinkLibraries frame #31: lldb_private::SwiftASTContext::LoadModule frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState frame #38: lldb_private::Target::GetPersistentSymbol frame #41: lldb_private::TypeSystemMap::ForEach <<<< Lock #1 frame #42: lldb_private::Target::GetPersistentSymbol frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols frame #44: lldb_private::IRExecutionUnit::FindSymbol frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol frame #48: llvm::LinkingSymbolResolver::findSymbol frame #49: llvm::LegacyJITSymbolResolver::lookup frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols frame #51: llvm::RuntimeDyldImpl::resolveRelocations frame #52: llvm::MCJIT::finalizeLoadedModules frame #53: llvm::MCJIT::finalizeObject frame #54: lldb_private::IRExecutionUnit::ReportAllocations frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo frame #56: lldb_private::ClangExpressionParser::PrepareForExecution frame #57: lldb_private::ClangUserExpression::TryParse frame #58: lldb_private::ClangUserExpression::Parse ``` Our solution is to simply iterate over a local copy of `m_map`. **Testing** * Confirmed on manual reproducer (would reproduce 100% of the time before the patch) Differential Revision: https://reviews.llvm.org/D149949
Currently, process of replacing bitwise operations consisting of `LSR`/`LSL` with `And` is performed by `DAGCombiner`. However, in certain cases, the `AND` generated by this process can be removed. Consider following case: ``` lsr x8, x8, #56 and x8, x8, #0xfc ldr w0, [x2, x8] ret ``` In this case, we can remove the `AND` by changing the target of `LDR` to `[X2, X8, LSL #2]` and right-shifting amount change to 56 to 58. after changed: ``` lsr x8, x8, #58 ldr w0, [x2, x8, lsl #2] ret ``` This patch checks to see if the `SHIFTING` + `AND` operation on load target can be optimized and optimizes it if it can.
This PR proposes an emulation mode for SYCL Graph. This provides the APIs of the sycl-graph-poc-v2 branch. The execution of the graph is however emulated by submitting each graph node when executing the graph (repeatedly). This addresses #20.
This provides the first step of merging the SYCL Graph POC into the mainline. The proposed process can be found here: intel#7627 (comment)