tensor scatter seems much slower than pytorch (cuda) #5

yinqiwen · 2024-01-29T11:10:34Z

reproduce by code below

fn scatter_add() -> candle_core::Result<()> {
    // let device = Device::new_cuda(0)?;
    let device = Device::new_cuda(0)?;
    let logits_idx_end = 32000_usize;
    let logits_idx = Tensor::arange(0_u32, logits_idx_end as u32, &device)?.reshape((1, 32000))?;
    let logits_idx_inv = Tensor::zeros_like(&logits_idx)?;
    let src = Tensor::arange(0_u32, logits_idx_end as u32, logits_idx.device())?
        .expand(logits_idx.shape())?
        .contiguous()?;
    let start = std::time::Instant::now();
    let logits_idx_inv = candle_ext::F::scatter(&logits_idx_inv, &logits_idx, &src, D::Minus1)?;
    match device {
        Device::Cuda(cuda_dev) => {
            cuda_dev.synchronize();
        }
        _ => {}
    }
    println!("scatter cost {:?}/{}", start.elapsed(), logits_idx_end);
    Ok(())
}

rust result(run 2times in the same process)

scatter cost 3.288861ms/32000
scatter cost 3.271358ms/32000

logits_idx = torch.arange(0,32000, dtype=torch.int64, device = 'cuda').reshape(1,32000)
logits_idx_inv = torch.zeros_like(logits_idx)
src = torch.arange(0,32000, device = 'cuda').expand(logits_idx.shape)
torch.cuda.synchronize()
start_time = time.time_ns()
logits_idx_inv = torch.empty_like(logits_idx).scatter_(dim=-1,index=logits_idx,src=src)
torch.cuda.synchronize()
print("first cuda scatter cost ", time.time_ns() - start_time, "ns", logits_idx.shape,logits_idx_inv.shape)


logits_idx = torch.arange(0,32000, dtype=torch.int64, device = 'cuda').reshape(1,32000)
logits_idx_inv = torch.zeros_like(logits_idx)
src = torch.arange(0,32000, device = 'cuda').expand(logits_idx.shape)
torch.cuda.synchronize()
start_time = time.time_ns()
logits_idx_inv = torch.empty_like(logits_idx).scatter_(dim=-1,index=logits_idx,src=src)
torch.cuda.synchronize()
print("cuda scatter cost ", time.time_ns() - start_time, "ns", logits_idx.shape,logits_idx_inv.shape)

python result(run 2times in the same process)

first cuda scatter cost  3191597 ns torch.Size([1, 32000]) torch.Size([1, 32000])
cuda scatter cost  38734 ns torch.Size([1, 32000]) torch.Size([1, 32000])

it seems pytorch run much faster after warmup.

The text was updated successfully, but these errors were encountered:

mokeyish · 2024-01-29T12:18:48Z

这个版本是参考 candle 官方改的

https://github.com/huggingface/candle/blob/main/candle-kernels/src/indexing.cu
https://github.com/mokeyish/candle-ext/blob/main/src/kernels/indexing.cu

candle 官方他们说是要减少算子，这样更方便适配到其他硬件平台，所以才写个扩展库写了这个。

可能需要参考 pytorch 的源码看看，它为什么那么快。

yinqiwen · 2024-01-30T09:39:46Z

猜测是block/thread设置不同导致的；实现上就是逐个element赋值，没啥区别

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/ScatterGatherKernel.cu#L155
https://github.com/mokeyish/candle-ext/blob/main/src/scatter.rs#L260

yinqiwen · 2024-02-27T07:22:15Z

faster cuda scatter port from pytorch
https://github.com/yinqiwen/lmsf/blob/rust/tops/src/scatter.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor scatter seems much slower than pytorch (cuda) #5

tensor scatter seems much slower than pytorch (cuda) #5

yinqiwen commented Jan 29, 2024

mokeyish commented Jan 29, 2024 •

edited

Loading

yinqiwen commented Jan 30, 2024

yinqiwen commented Feb 27, 2024

tensor scatter seems much slower than pytorch (cuda) #5

tensor scatter seems much slower than pytorch (cuda) #5

Comments

yinqiwen commented Jan 29, 2024

mokeyish commented Jan 29, 2024 • edited Loading

yinqiwen commented Jan 30, 2024

yinqiwen commented Feb 27, 2024

mokeyish commented Jan 29, 2024 •

edited

Loading