Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL Graph emulation mode #56

Closed
wants to merge 2 commits into from

Conversation

julianmi
Copy link
Collaborator

This PR proposes an emulation mode for SYCL Graph. This provides the APIs of the sycl-graph-poc-v2 branch. The execution of the graph is however emulated by submitting each graph node when executing the graph (repeatedly). This addresses #20.

This provides the first step of merging the SYCL Graph POC into the mainline. The proposed process can be found here: intel#7627 (comment)

@julianmi julianmi added the Graph Implementation Related to DPC++ implementation and testing label Dec 13, 2022
@julianmi julianmi requested review from EwanC, reble and Bensuo December 13, 2022 18:23
@@ -1307,44 +1307,6 @@ pi_result _pi_context::getAvailableCommandList(
pi_queue Queue, pi_command_list_ptr_t &CommandList, bool UseCopyEngine,
bool AllowBatching, ze_command_queue_handle_t *ForcedCmdQueue) {

// This is a hack. TODO: Proper CommandList allocation per Executable Graph.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this emulation mode to be most useful, in my mind it wouldn't live in a separate branch that we would need to keep up-to-date with the PoC branch, but live in the PoC branch itself.

Rather than removing this code, have you considered making emulation mode an environment variable users can set to ON?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that makes sense as the first part of the 3 stage approach. I'm a bit wary of us trying to maintain a lot of branches, so I think we should sync up on our call later about how soon we want to open that stage 1 PR. If it's ASAP then using sycl-graph-poc-emulation to do that I think is good. However, if it's more delayed and more changes start getting merged into the PoC branch, then maintaining sycl-graph-poc-emulation will be an overhead, and it might just be better to have the path in the PoC branch guarded by a preprocessor macro. Then, once we want to open a DPC++ PR with just stage 1, we can easily remove the macro and just keep the emulation path.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.

I think the 3-step merge process referred to in the comment on the POC PR is for a future final implementation and not for the POC (which this code is based on). Unless we are planning to close that open PR and resubmit this as the first step, I agree with the idea of having this be optional functionality in the POC guarded by a macro.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this in favor of a version that is guarded by the SYCL Graph macro as discussed. I will further split the testing part of this PR into a separate PR.

@julianmi julianmi closed this Jan 20, 2023
reble pushed a commit that referenced this pull request May 15, 2023
…callback

The `TypeSystemMap::m_mutex` guards against concurrent modifications
of members of `TypeSystemMap`. In particular, `m_map`.

`TypeSystemMap::ForEach` iterates through the entire `m_map` calling
a user-specified callback for each entry. This is all done while
`m_mutex` is locked. However, there's nothing that guarantees that
the callback itself won't call back into `TypeSystemMap` APIs on the
same thread. This lead to double-locking `m_mutex`, which is undefined
behaviour. We've seen this cause a deadlock in the swift plugin with
following backtrace:

```

int main() {
    std::unique_ptr<int> up = std::make_unique<int>(5);

    volatile int val = *up;
    return val;
}

clang++ -std=c++2a -g -O1 main.cpp

./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b
```

```
frame #4: std::lock_guard<std::mutex>::lock_guard
frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2
frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage
frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage
...
frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths
frame #27: lldb_private::SwiftASTContext::LoadModule
frame #30: swift::ModuleDecl::collectLinkLibraries
frame #31: lldb_private::SwiftASTContext::LoadModule
frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl
frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports
frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext
frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState
frame #38: lldb_private::Target::GetPersistentSymbol
frame #41: lldb_private::TypeSystemMap::ForEach                 <<<< Lock #1
frame #42: lldb_private::Target::GetPersistentSymbol
frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols
frame #44: lldb_private::IRExecutionUnit::FindSymbol
frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence
frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #48: llvm::LinkingSymbolResolver::findSymbol
frame #49: llvm::LegacyJITSymbolResolver::lookup
frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols
frame #51: llvm::RuntimeDyldImpl::resolveRelocations
frame #52: llvm::MCJIT::finalizeLoadedModules
frame #53: llvm::MCJIT::finalizeObject
frame #54: lldb_private::IRExecutionUnit::ReportAllocations
frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo
frame #56: lldb_private::ClangExpressionParser::PrepareForExecution
frame #57: lldb_private::ClangUserExpression::TryParse
frame #58: lldb_private::ClangUserExpression::Parse
```

Our solution is to simply iterate over a local copy of `m_map`.

**Testing**

* Confirmed on manual reproducer (would reproduce 100% of the time
  before the patch)

Differential Revision: https://reviews.llvm.org/D149949
aarongreig pushed a commit that referenced this pull request Sep 24, 2024
Currently, process of replacing bitwise operations consisting of
`LSR`/`LSL` with `And` is performed by `DAGCombiner`.

However, in certain cases, the `AND` generated by this process
can be removed.

Consider following case:
```
        lsr x8, x8, #56
        and x8, x8, #0xfc
        ldr w0, [x2, x8]
        ret
```

In this case, we can remove the `AND` by changing the target of `LDR`
to `[X2, X8, LSL #2]` and right-shifting amount change to 56 to 58.

after changed:
```
        lsr x8, x8, #58
        ldr w0, [x2, x8, lsl #2]
        ret
```

This patch checks to see if the `SHIFTING` + `AND` operation on load
target can be optimized and optimizes it if it can.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Graph Implementation Related to DPC++ implementation and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants