Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] RuntimeError - Something wrong with cuda runtime #1011

Closed
Arnie0426 opened this issue Aug 1, 2021 · 7 comments
Closed

[QST] RuntimeError - Something wrong with cuda runtime #1011

Arnie0426 opened this issue Aug 1, 2021 · 7 comments
Assignees
Labels
question Further information is requested

Comments

@Arnie0426
Copy link
Contributor

Hi everyone, really impressed with a lot of the gains we have gotten with NVTabular but we are starting to see this cryptic cuda error when we try to "fit" a workflow with a docker container.

Here's the error traceback and I would appreciate any pointers!

    workflow.fit(merged_dataset)
  File "/nvtabular/nvtabular/workflow.py", line 160, in fit
    results = [r.result() for r in self.client.compute(stats)]
  File "/nvtabular/nvtabular/workflow.py", line 160, in <listcomp>
    results = [r.result() for r in self.client.compute(stats)]
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 220, in result
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/nvtabular/nvtabular/ops/categorify.py", line 874, in _write_uniques
    df = type(df)(new_cols)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 301, in __init__
    self._init_from_dict_like(data, index=index, columns=columns)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 459, in _init_from_dict_like
    data, index = self._align_input_series_indices(data, index=index)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 530, in _align_input_series_indices
    aligned_input_series = cudf.core.series._align_indices(
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7203, in _align_indices
    result = [
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7204, in <listcomp>
    sr._align_to_index(
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 6366, in _align_to_index
    if self.index.equals(index):
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 1767, in equals
    return super().equals(other)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 231, in equals
    return super(Index, self).equals(other, check_types=check_types)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/frame.py", line 558, in equals
    if not self_col.equals(other_col, check_dtypes=check_types):
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 183, in equals
    null_equals = self._null_equals(other)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 187, in _null_equals
    return self.binary_operator("NULL_EQUALS", other)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 132, in binary_operator
    return _numeric_column_binop(
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 723, in _numeric_column_binop
    out = libcudf.binaryop.binaryop(lhs, rhs, op, out_dtype)
  File "cudf/_lib/binaryop.pyx", line 194, in cudf._lib.binaryop.binaryop
  File "cudf/_lib/binaryop.pyx", line 110, in cudf._lib.binaryop.binaryop_v_v
RuntimeError: Compilation failed: NVRTC_ERROR_COMPILATION
Compiler options: "-std=c++14 -D__CUDACC_RTC__ -default-device -arch=sm_70"
Header names:
  algorithm
  binaryop/jit/operation-udf.hpp
  binaryop/jit/operation.hpp
  binaryop/jit/traits.hpp
  cassert
  cfloat
  climits
  cmath
  cstddef
  cstdint
  ctime
  cuda/std/chrono
  cuda/std/climits
  cuda/std/cstddef
  cuda/std/limits
  cuda/std/type_traits
  cuda_runtime.h
  cudf/detail/utilities/assert.cuh
  cudf/fixed_point/fixed_point.hpp
  cudf/types.hpp
  cudf/utilities/bit.hpp
  cudf/wrappers/durations.hpp
  cudf/wrappers/timestamps.hpp
  detail/__config
  detail/__pragma_pop
  detail/__pragma_push
  detail/libcxx/include/chrono
  detail/libcxx/include/climits
  detail/libcxx/include/cstddef
  detail/libcxx/include/ctime
  detail/libcxx/include/limits
  detail/libcxx/include/ratio
  detail/libcxx/include/type_traits
  detail/libcxx/include/version
  iterator
  libcxx/include/__config
  libcxx/include/__pragma_pop
  libcxx/include/__pragma_push
  libcxx/include/__undef_macros
  limits
  ratio
  string
  type_traits
  version

detail/libcxx/include/limits(33): error: floating constant is out of range

detail/libcxx/include/limits(39): error: floating constant is out of range

2 errors detected in the compilation of "binaryop/jit/kernel.cu".

For some information - I are running 0.5.2 NVidia-merlin image (nvcr.io/nvidia/merlin/merlin-pytorch-training:0.5.2) and git pulling the 0.6.0 commit (886d5b85fee83acfefc3f60c282f723f41719d53) in /nvtabular.

This job is being run in a docker container in EKS (AWS), with 460.73.01 nvidia driver, and cuda 11.2.

When I printenv, I do notice this environment variable but I am not sure what to do about it:

_CUDA_COMPAT_STATUS=System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803.

Any ideas on what I may be missing here?

@Arnie0426 Arnie0426 added the question Further information is requested label Aug 1, 2021
@benfred
Copy link
Member

benfred commented Aug 1, 2021

What does nvidia-smi show?

Also can you try removing the .cudf directory from your home folder ?

@Arnie0426
Copy link
Contributor Author

Arnie0426 commented Aug 1, 2021

root@dev-box-mrrdv:~# nvidia-smi
Sun Aug  1 04:17:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    65W / 300W |  14137MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   47C    P0    67W / 300W |  13534MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   44C    P0    70W / 300W |  13534MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   48C    P0    75W / 300W |  13534MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

It's a job that spins up a fresh docker image each time - and there doesn't seem to be a ~/.cudf in that image

root@dev-box-mrrdv:~# ls ~/.cudf
ls: cannot access '/usr/local/.cudf': No such file or directory

I did see a similar issue here, which makes me wonder if it could be an issue with the image.
NVIDIA/nvidia-docker#1256

When I echo LD_LIBRARY_PATH, I see this in the container:
/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

Edit - For more context, I am running amazon-eks-gpu-node-1.19-v20210504 AMI (https://github.com/awslabs/amazon-eks-ami/releases/tag/v20210504)

@Arnie0426
Copy link
Contributor Author

Arnie0426 commented Aug 3, 2021

So update on this: I am not entirely sure what the root cause was but I managed to upgrade our AMI to amazon-eks-gpu-node-1.19-v20210722 and I haven't seen this error since. I looked at the release notes of the last few amis and I don't see any update to the underlying cuda drivers - so I am not entirely sure what changed.

For anyone curious, I confirmed on a barebones 202010504 AMI that the issue above can be replicated with:

import nvtabular as nvt
import cudf
df = cudf.DataFrame({
    'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A', 'User_A', 'User_A', 'User_A', 'User_A'],
    'productID': [100, 101, 102, 101, 102, 103, 103, 104, 105, 106, 107],
    'label': [0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1]
})
dataset = nvt.Dataset(df)

grouped = ["author", "productID"] >> nvt.ops.Groupby(groupby_cols=["author"], aggs="count") 

workflow = nvt.Workflow(grouped)
non_spam_authors_dataset = workflow.fit_transform(dataset)
merged_dataset = nvt.Dataset.merge(non_spam_authors_dataset, dataset, on="author", how="inner")

operations = ["author", "productID", "label"] >> nvt.ops.Categorify()
wf2 = nvt.Workflow(operations)

wf2.fit(merged_dataset)

@benfred
Copy link
Member

benfred commented Aug 3, 2021

We've seen something similar before in this issue rapidsai/cudf#7496 - does the reproducer I posted there also trigger this bug?

@Arnie0426
Copy link
Contributor Author

Arnie0426 commented Aug 4, 2021

It does!

image

Also I may have gotten excited a bit too quickly - another teammate saw the exact runtime error in the updated AMI too. The docker image I was playing around with doesn't have any explicit dependency on gevent, but I assume that the merlin image does.

root@jupyter-notebook-84bbc97f5f-qfhzd:~# pip freeze | grep gevent
gevent==21.1.2
geventhttpclient==1.4.4

@Arnie0426
Copy link
Contributor Author

Thanks to @benfred, found a fix for it in our docker container.
reinstalling gevent from source (rapidsai/cudf#7496) RUN pip install --ignore-installed --no-binary gevent gevent seems to fix the issue. Could we perhaps make this part of the merlin docker images (if this is indeed the fix?)

@benfred
Copy link
Member

benfred commented Aug 6, 2021

thanks! glad that you got this working.

Tracking the gevent changes in the containers here NVIDIA-Merlin/Merlin#27

@benfred benfred closed this as completed Aug 6, 2021
@benfred benfred self-assigned this Aug 6, 2021
@viswa-nvidia viswa-nvidia added this to the NVTabular-v21.09 milestone Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants