You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
When we have many users and/or items, the sizes of these embeddings tables quickly increases the amount of GPU memory we consume leading to an OOM error even before starting training.
A clear and concise description of what you want to happen.
Ideally, the embedding tables will instead live on the CPU, and during training, be indexed into and only bring that subset of the embeddings to the GPU for the model forward and backward. Then, after the batch, the subset of embeddings are released from GPU memory.
A clear and concise description of any alternative solutions or features you've considered.
It's also possible to accomplish this by splitting the embeddings up over many GPUs and use a model parallel, multi-GPU solution. But this solution outlined above allows a scalable model on a single node, which may be more desirable to more users of the library.
Add any other context or information about the feature request here.
Is your feature request related to a problem? Please describe.
When we have many users and/or items, the sizes of these embeddings tables quickly increases the amount of GPU memory we consume leading to an OOM error even before starting training.
Ideally, the embedding tables will instead live on the CPU, and during training, be indexed into and only bring that subset of the embeddings to the GPU for the model forward and backward. Then, after the batch, the subset of embeddings are released from GPU memory.
It's also possible to accomplish this by splitting the embeddings up over many GPUs and use a model parallel, multi-GPU solution. But this solution outlined above allows a scalable model on a single node, which may be more desirable to more users of the library.
See here for some related discussion on this.
The text was updated successfully, but these errors were encountered: