merge from master #4

elikosan · 2017-06-01T20:45:42Z

No description provided.

This is similar to THCCachingHostAllocator_recordEvent() but on CUDA allocations. It's useful for overlapping copies with computation. The workflow is approximately: 0. allocate dst tensor on copy stream 1. copy from CPU to GPU on copy stream 2. synchronize the main stream with the copy stream via cudaStreamWaitEvent 3. THCCachingAllocator_recordStream(dst, main_stream) The recordStream() call is necessary to prevent the dst tensor from begin reused on the copy stream before the main stream finishes work. Previously, you would need to insert a second cudaStreamWaitEvent before dst is freed to force the copy stream to wait on the main stream.

Add THCCachingAllocator_recordStream()

Add CUDA caching allocator accessor

Check event_count before merging blocks

fix bug that invalidates all tests

key only block-wide bitonic sort

add implementation of inclusive scan via upsweep-downsweep

linspace and logspace for CUDA Tensors

Narrow V when returning only some right singular vectors

…o cwrap

Make rinfo_ optional in btrifact

Use zero instead of mul when beta == 0 in addr

Update btrisolve argument order.

Time to get rid of warp-synchronous code. It will break!

For large 1D tensors thrust::inclusive_scan is much faster than our current implementation.

* move TopK to generic * partial genericization of kernel code * introduce TopKTypeConfig, specialize radix type and conversion for floats * implement topk for byte tensor * implement for char tensor * implement for int tensor, extend test to check indices as well * works for longs too * make bitfield set/get a struct, add support for 64-bit types * extend to double tensor * implement for half tensor * asserts; test fix

By default, this parameter is False -- a backwards incompatible change, but one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter "keepdims" since you can pass multiple dims to reduction functions). The old behavior seems desired for normalization type operations where the tensor will immediately be expanded out again, e.g.: probs.sum(1).expand_as(probs) which no longer works because the dimension to expand is missing. This can be fixed by simply passing True as "keepdim" argument to the reduction operation, e.g: probs.sum(1, keepdim=True).expand_as(probs)

colesbury and others added 30 commits March 6, 2017 10:50

Merge pull request #721 from colesbury/recordStream

d4c2b1d

Add THCCachingAllocator_recordStream()

add implementation of inclusive scan via upsweep-downsweep

3f2daf7

Add CUDA caching allocator accessor

f7c6799

Merge pull request #724 from guillaumekln/caching-allocator-accessor

7126f4b

Add CUDA caching allocator accessor

Check event_count before merging blocks

de7c6a2

Merge pull request #725 from colesbury/recordStream

937bf53

Check event_count before merging blocks

fix bug in testing code

9918aad

Merge pull request #726 from killeent/test-fix

c31cc58

fix bug that invalidates all tests

key only block-wide bitonic sort

dae8425

Merge pull request #727 from killeent/key-only-sort

4cbe933

key only block-wide bitonic sort

Merge pull request #723 from killeent/scan-primitive

bf5ad85

add implementation of inclusive scan via upsweep-downsweep

implement linspace, logspace and range in CUDA

8f6fcb3

Merge pull request #729 from shenxiul/cuda_linspace

acb15e2

linspace and logspace for CUDA Tensors

Narrow V when returning only some right singular vectors

e403cc6

Merge pull request #732 from apaszke/magma_svd

5f17aa8

Narrow V when returning only some right singular vectors

adding batch triangular factorization and solves, add IntegerTensor t…

488bc78

…o cwrap

Make rinfo_ optional in btrifact

a758aa6

Merge pull request #733 from apaszke/btri

d65d29d

Make rinfo_ optional in btrifact

Use zero instead of mul when beta == 0 in addr

ff33382

Merge pull request #734 from apaszke/ger

08be017

Use zero instead of mul when beta == 0 in addr

Update btrisolve argument order.

a7217c0

Merge pull request #735 from bamos/master

2c9bd03

Update btrisolve argument order.

Add mode kernel

8fa1f6f

Get rid of warp-synchronous code (#739)

c6f3bb3

Time to get rid of warp-synchronous code. It will break!

Use thrust::inclusive_scan for 1D cumsum/cumprod (#742)

6e0ef02

For large 1D tensors thrust::inclusive_scan is much faster than our current implementation.

Fix remainder and cremainder for integer types

8312fd1

fix memory leak in btrisolve and getri

bdbfbbf

block wide reduction with multiple values to reduce at once (#745)

46a5929

Fix abs with char and short cuda types. (#747)

8f9a4fa

lvdmaaten and others added 19 commits April 19, 2017 06:57

Make luaL_setfuncs detection more robust

9ffa012

Include THCNumerics.cuh in THCAtomics.cuh. (#752)

76dc265

create and expose handles for cusparse

c3f2db3

add cusparse link dependency

181a869

implement expand/expandAs in CPU/GPU code

0882670

Change magma_sgesvd to magma_sgesdd which is significantly faster

1442d41

guard topk for half (#759)

150f70c

half<->float conversion cleanup (#680)

be0f9a6

add device asserts in scatter/gather kernels

abc860b

s/IndexType/long

8fb354f

use current stream in cat array kernel launch

fdc73fa

Implement lgamma function.

3502535

Fix bug in magma qr decomposition and add tests for larger matrices

cbaf928

Add keepdim to lua cwrap. (#763)

640ea15

Make torch.cat not synchronize the host and device

951070d

Cuda reduce in a consistent direction

92e9c08

Add scatterAdd

c075de1

elikosan merged commit b7bd5f0 into elikosan:master Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from master #4

merge from master #4

elikosan commented Jun 1, 2017

merge from master #4

merge from master #4

Conversation

elikosan commented Jun 1, 2017