add implementation of inclusive scan via upsweep-downsweep #723

killeent · 2017-03-08T15:41:22Z

A slightly more efficient Scan that uses upsweep/downsweep like mechanisms.

Tested outside of cutorch codebase on buffers of size 2, 3, 21, 32, 33, 64 and verified that it properly calculated the prefix sum when templatized via an addition binary op.

wickedfoo · 2017-03-08T16:07:00Z

lib/THC/THCScanUtils.cuh

+#pragma unroll
+  for (int stride = 1; stride < Power2ScanSize; stride <<= 1) {
+    int index = (threadIdx.x + 1) * stride * 2 - 1;
+    if (index < Power2ScanSize) {


"This code still works for collections that
+// do not exactly contain a power of 2 number of elements, simply round up to the
+// nearest power of 2 and then call."

This is not true here, you should pass in an extra size parameter for the data instead?

I guess to clarify, the algorithm works as long as you have Power2ScanSize space in smem, but yes we could add a size parameter to condition on as well.

actually, nevermind, it won't work with a non-power of 2 size, this is fine.

duh, I take that back, it will work. A size parameter sounds like a good addition, since not every scan will involve a power-of-2 size, especially the tail end of a set of data (either that, or the smem will have to be reset with an identity value for the reduction; e.g., for +, would have to be filled with 0).

pavanky · 2017-03-08T17:49:58Z

Can you note the performance improvements ?

wickedfoo · 2017-03-08T17:53:46Z

lib/THC/THCScanUtils.cuh

+//                  15
+//         3     10    21
+template <typename T, class BinaryOp, int Power2ScanSize>
+__device__ void inclusivePrefixScan(T *smem, BinaryOp binop) {


Also this function should take an input and pass an output like the others, instead of assuming the values are already in shared memory?

In the other case each thread is responsible for a single element, where as in this case each thread has two associated elements. So we could match it via:

__device__ void inclusivePrefixScan(T *smem, T a, T b, T *out, BinaryOp op) { ... }

but I think it is a little less clean than compared with the others.

soumith · 2017-03-15T18:37:07Z

@pavanky these changes are all for the upcoming mode kernel that @killeent is working on.

add implementation of inclusive scan via upsweep-downsweep

3f2daf7

wickedfoo reviewed Mar 8, 2017

View reviewed changes

soumith merged commit bf5ad85 into torch:master Mar 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add implementation of inclusive scan via upsweep-downsweep #723

add implementation of inclusive scan via upsweep-downsweep #723

killeent commented Mar 8, 2017

wickedfoo Mar 8, 2017

killeent Mar 8, 2017

wickedfoo Mar 9, 2017

wickedfoo Mar 9, 2017

pavanky commented Mar 8, 2017

wickedfoo Mar 8, 2017

killeent Mar 8, 2017 •

edited

Loading

soumith commented Mar 15, 2017

add implementation of inclusive scan via upsweep-downsweep #723

add implementation of inclusive scan via upsweep-downsweep #723

Conversation

killeent commented Mar 8, 2017

wickedfoo Mar 8, 2017

Choose a reason for hiding this comment

killeent Mar 8, 2017

Choose a reason for hiding this comment

wickedfoo Mar 9, 2017

Choose a reason for hiding this comment

wickedfoo Mar 9, 2017

Choose a reason for hiding this comment

pavanky commented Mar 8, 2017

wickedfoo Mar 8, 2017

Choose a reason for hiding this comment

killeent Mar 8, 2017 • edited Loading

Choose a reason for hiding this comment

soumith commented Mar 15, 2017

killeent Mar 8, 2017 •

edited

Loading