forked from torch/cutorch
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge from master #4
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is similar to THCCachingHostAllocator_recordEvent() but on CUDA allocations. It's useful for overlapping copies with computation. The workflow is approximately: 0. allocate dst tensor on copy stream 1. copy from CPU to GPU on copy stream 2. synchronize the main stream with the copy stream via cudaStreamWaitEvent 3. THCCachingAllocator_recordStream(dst, main_stream) The recordStream() call is necessary to prevent the dst tensor from begin reused on the copy stream before the main stream finishes work. Previously, you would need to insert a second cudaStreamWaitEvent before dst is freed to force the copy stream to wait on the main stream.
Add THCCachingAllocator_recordStream()
Add CUDA caching allocator accessor
Check event_count before merging blocks
fix bug that invalidates all tests
key only block-wide bitonic sort
add implementation of inclusive scan via upsweep-downsweep
linspace and logspace for CUDA Tensors
Narrow V when returning only some right singular vectors
Make rinfo_ optional in btrifact
Use zero instead of mul when beta == 0 in addr
Update btrisolve argument order.
Time to get rid of warp-synchronous code. It will break!
For large 1D tensors thrust::inclusive_scan is much faster than our current implementation.
* move TopK to generic * partial genericization of kernel code * introduce TopKTypeConfig, specialize radix type and conversion for floats * implement topk for byte tensor * implement for char tensor * implement for int tensor, extend test to check indices as well * works for longs too * make bitfield set/get a struct, add support for 64-bit types * extend to double tensor * implement for half tensor * asserts; test fix
By default, this parameter is False -- a backwards incompatible change, but one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter "keepdims" since you can pass multiple dims to reduction functions). The old behavior seems desired for normalization type operations where the tensor will immediately be expanded out again, e.g.: probs.sum(1).expand_as(probs) which no longer works because the dimension to expand is missing. This can be fixed by simply passing True as "keepdim" argument to the reduction operation, e.g: probs.sum(1, keepdim=True).expand_as(probs)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.