[QST] RuntimeError - Something wrong with cuda runtime #1011

Arnie0426 · 2021-08-01T03:44:32Z

Hi everyone, really impressed with a lot of the gains we have gotten with NVTabular but we are starting to see this cryptic cuda error when we try to "fit" a workflow with a docker container.

Here's the error traceback and I would appreciate any pointers!

    workflow.fit(merged_dataset)
  File "/nvtabular/nvtabular/workflow.py", line 160, in fit
    results = [r.result() for r in self.client.compute(stats)]
  File "/nvtabular/nvtabular/workflow.py", line 160, in <listcomp>
    results = [r.result() for r in self.client.compute(stats)]
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 220, in result
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/nvtabular/nvtabular/ops/categorify.py", line 874, in _write_uniques
    df = type(df)(new_cols)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 301, in __init__
    self._init_from_dict_like(data, index=index, columns=columns)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 459, in _init_from_dict_like
    data, index = self._align_input_series_indices(data, index=index)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 530, in _align_input_series_indices
    aligned_input_series = cudf.core.series._align_indices(
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7203, in _align_indices
    result = [
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7204, in <listcomp>
    sr._align_to_index(
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 6366, in _align_to_index
    if self.index.equals(index):
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 1767, in equals
    return super().equals(other)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 231, in equals
    return super(Index, self).equals(other, check_types=check_types)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/frame.py", line 558, in equals
    if not self_col.equals(other_col, check_dtypes=check_types):
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 183, in equals
    null_equals = self._null_equals(other)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 187, in _null_equals
    return self.binary_operator("NULL_EQUALS", other)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 132, in binary_operator
    return _numeric_column_binop(
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 723, in _numeric_column_binop
    out = libcudf.binaryop.binaryop(lhs, rhs, op, out_dtype)
  File "cudf/_lib/binaryop.pyx", line 194, in cudf._lib.binaryop.binaryop
  File "cudf/_lib/binaryop.pyx", line 110, in cudf._lib.binaryop.binaryop_v_v
RuntimeError: Compilation failed: NVRTC_ERROR_COMPILATION
Compiler options: "-std=c++14 -D__CUDACC_RTC__ -default-device -arch=sm_70"
Header names:
  algorithm
  binaryop/jit/operation-udf.hpp
  binaryop/jit/operation.hpp
  binaryop/jit/traits.hpp
  cassert
  cfloat
  climits
  cmath
  cstddef
  cstdint
  ctime
  cuda/std/chrono
  cuda/std/climits
  cuda/std/cstddef
  cuda/std/limits
  cuda/std/type_traits
  cuda_runtime.h
  cudf/detail/utilities/assert.cuh
  cudf/fixed_point/fixed_point.hpp
  cudf/types.hpp
  cudf/utilities/bit.hpp
  cudf/wrappers/durations.hpp
  cudf/wrappers/timestamps.hpp
  detail/__config
  detail/__pragma_pop
  detail/__pragma_push
  detail/libcxx/include/chrono
  detail/libcxx/include/climits
  detail/libcxx/include/cstddef
  detail/libcxx/include/ctime
  detail/libcxx/include/limits
  detail/libcxx/include/ratio
  detail/libcxx/include/type_traits
  detail/libcxx/include/version
  iterator
  libcxx/include/__config
  libcxx/include/__pragma_pop
  libcxx/include/__pragma_push
  libcxx/include/__undef_macros
  limits
  ratio
  string
  type_traits
  version

detail/libcxx/include/limits(33): error: floating constant is out of range

detail/libcxx/include/limits(39): error: floating constant is out of range

2 errors detected in the compilation of "binaryop/jit/kernel.cu".

For some information - I are running 0.5.2 NVidia-merlin image (nvcr.io/nvidia/merlin/merlin-pytorch-training:0.5.2) and git pulling the 0.6.0 commit (886d5b85fee83acfefc3f60c282f723f41719d53) in /nvtabular.

This job is being run in a docker container in EKS (AWS), with 460.73.01 nvidia driver, and cuda 11.2.

When I printenv, I do notice this environment variable but I am not sure what to do about it:

_CUDA_COMPAT_STATUS=System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803.

Any ideas on what I may be missing here?

The text was updated successfully, but these errors were encountered:

benfred · 2021-08-01T03:54:27Z

What does nvidia-smi show?

Also can you try removing the .cudf directory from your home folder ?

Arnie0426 · 2021-08-01T04:20:28Z

root@dev-box-mrrdv:~# nvidia-smi
Sun Aug  1 04:17:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    65W / 300W |  14137MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   47C    P0    67W / 300W |  13534MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   44C    P0    70W / 300W |  13534MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   48C    P0    75W / 300W |  13534MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

It's a job that spins up a fresh docker image each time - and there doesn't seem to be a ~/.cudf in that image

root@dev-box-mrrdv:~# ls ~/.cudf
ls: cannot access '/usr/local/.cudf': No such file or directory

I did see a similar issue here, which makes me wonder if it could be an issue with the image.
NVIDIA/nvidia-docker#1256

When I echo LD_LIBRARY_PATH, I see this in the container:
/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

Edit - For more context, I am running amazon-eks-gpu-node-1.19-v20210504 AMI (https://github.com/awslabs/amazon-eks-ami/releases/tag/v20210504)

Arnie0426 · 2021-08-03T04:15:53Z

So update on this: I am not entirely sure what the root cause was but I managed to upgrade our AMI to amazon-eks-gpu-node-1.19-v20210722 and I haven't seen this error since. I looked at the release notes of the last few amis and I don't see any update to the underlying cuda drivers - so I am not entirely sure what changed.

For anyone curious, I confirmed on a barebones 202010504 AMI that the issue above can be replicated with:

import nvtabular as nvt
import cudf
df = cudf.DataFrame({
    'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A', 'User_A', 'User_A', 'User_A', 'User_A'],
    'productID': [100, 101, 102, 101, 102, 103, 103, 104, 105, 106, 107],
    'label': [0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1]
})
dataset = nvt.Dataset(df)

grouped = ["author", "productID"] >> nvt.ops.Groupby(groupby_cols=["author"], aggs="count") 

workflow = nvt.Workflow(grouped)
non_spam_authors_dataset = workflow.fit_transform(dataset)
merged_dataset = nvt.Dataset.merge(non_spam_authors_dataset, dataset, on="author", how="inner")

operations = ["author", "productID", "label"] >> nvt.ops.Categorify()
wf2 = nvt.Workflow(operations)

wf2.fit(merged_dataset)

benfred · 2021-08-03T16:37:50Z

We've seen something similar before in this issue rapidsai/cudf#7496 - does the reproducer I posted there also trigger this bug?

Arnie0426 · 2021-08-04T07:22:15Z

It does!

Also I may have gotten excited a bit too quickly - another teammate saw the exact runtime error in the updated AMI too. The docker image I was playing around with doesn't have any explicit dependency on gevent, but I assume that the merlin image does.

root@jupyter-notebook-84bbc97f5f-qfhzd:~# pip freeze | grep gevent
gevent==21.1.2
geventhttpclient==1.4.4

Arnie0426 · 2021-08-05T14:51:31Z

Thanks to @benfred, found a fix for it in our docker container.
reinstalling gevent from source (rapidsai/cudf#7496) RUN pip install --ignore-installed --no-binary gevent gevent seems to fix the issue. Could we perhaps make this part of the merlin docker images (if this is indeed the fix?)

benfred · 2021-08-06T17:49:21Z

thanks! glad that you got this working.

Tracking the gevent changes in the containers here NVIDIA-Merlin/Merlin#27

Arnie0426 added the question Further information is requested label Aug 1, 2021

benfred mentioned this issue Aug 6, 2021

gevent on merlin containers doesn't work with cudf NVIDIA-Merlin/Merlin#27

Closed

benfred closed this as completed Aug 6, 2021

benfred self-assigned this Aug 6, 2021

viswa-nvidia added this to the NVTabular-v21.09 milestone Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] RuntimeError - Something wrong with cuda runtime #1011

[QST] RuntimeError - Something wrong with cuda runtime #1011

Arnie0426 commented Aug 1, 2021

benfred commented Aug 1, 2021

Arnie0426 commented Aug 1, 2021 •

edited

Loading

Arnie0426 commented Aug 3, 2021 •

edited

Loading

benfred commented Aug 3, 2021

Arnie0426 commented Aug 4, 2021 •

edited

Loading

Arnie0426 commented Aug 5, 2021

benfred commented Aug 6, 2021

[QST] RuntimeError - Something wrong with cuda runtime #1011

[QST] RuntimeError - Something wrong with cuda runtime #1011

Comments

Arnie0426 commented Aug 1, 2021

benfred commented Aug 1, 2021

Arnie0426 commented Aug 1, 2021 • edited Loading

Arnie0426 commented Aug 3, 2021 • edited Loading

benfred commented Aug 3, 2021

Arnie0426 commented Aug 4, 2021 • edited Loading

Arnie0426 commented Aug 5, 2021

benfred commented Aug 6, 2021

Arnie0426 commented Aug 1, 2021 •

edited

Loading

Arnie0426 commented Aug 3, 2021 •

edited

Loading

Arnie0426 commented Aug 4, 2021 •

edited

Loading