-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] RuntimeError - Something wrong with cuda runtime #1011
Comments
What does nvidia-smi show? Also can you try removing the .cudf directory from your home folder ? |
It's a job that spins up a fresh docker image each time - and there doesn't seem to be a
I did see a similar issue here, which makes me wonder if it could be an issue with the image. When I echo Edit - For more context, I am running |
So update on this: I am not entirely sure what the root cause was but I managed to upgrade our AMI to For anyone curious, I confirmed on a barebones 202010504 AMI that the issue above can be replicated with: import nvtabular as nvt
import cudf
df = cudf.DataFrame({
'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A', 'User_A', 'User_A', 'User_A', 'User_A'],
'productID': [100, 101, 102, 101, 102, 103, 103, 104, 105, 106, 107],
'label': [0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1]
})
dataset = nvt.Dataset(df)
grouped = ["author", "productID"] >> nvt.ops.Groupby(groupby_cols=["author"], aggs="count")
workflow = nvt.Workflow(grouped)
non_spam_authors_dataset = workflow.fit_transform(dataset)
merged_dataset = nvt.Dataset.merge(non_spam_authors_dataset, dataset, on="author", how="inner")
operations = ["author", "productID", "label"] >> nvt.ops.Categorify()
wf2 = nvt.Workflow(operations)
wf2.fit(merged_dataset) |
We've seen something similar before in this issue rapidsai/cudf#7496 - does the reproducer I posted there also trigger this bug? |
It does! Also I may have gotten excited a bit too quickly - another teammate saw the exact runtime error in the updated AMI too. The docker image I was playing around with doesn't have any explicit dependency on gevent, but I assume that the merlin image does.
|
Thanks to @benfred, found a fix for it in our docker container. |
thanks! glad that you got this working. Tracking the gevent changes in the containers here NVIDIA-Merlin/Merlin#27 |
Hi everyone, really impressed with a lot of the gains we have gotten with NVTabular but we are starting to see this cryptic cuda error when we try to "fit" a workflow with a docker container.
Here's the error traceback and I would appreciate any pointers!
For some information - I are running 0.5.2 NVidia-merlin image (nvcr.io/nvidia/merlin/merlin-pytorch-training:0.5.2) and
git pull
ing the 0.6.0 commit (886d5b85fee83acfefc3f60c282f723f41719d53
) in/nvtabular
.This job is being run in a docker container in EKS (AWS), with
460.73.01
nvidia driver, and cuda 11.2.When I
printenv
, I do notice this environment variable but I am not sure what to do about it:_CUDA_COMPAT_STATUS=System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803
.Any ideas on what I may be missing here?
The text was updated successfully, but these errors were encountered: