-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anyone able to run 7B on google colab? #120
Comments
Not for free users. The model must be loaded into the CPU memory (if we are talking about this repo) but Colab only provides a capacity of less than 13G. Therefore, it cannot be run on the free version of Colab. However, it may work if you have a Pro subscription. I look forward to hearing about your outcome. |
What about loading it onto the TPU? KoboldAI can load 20B models (Like Erebus 20B) just fine to TPU, barring the fact it takes about 15 minutes to load a 40~ GB model, which Erebus specifically is split into 23 parts, though once running, they're pretty fast. Maybe there's a way to reduce the per-segment size from 15~ GB down to 2 GB, by splitting it into smaller parts, the same way Erebus is. |
I got it to run on a Shadow PC #105 which has only 12GB RAM. So it crunched the page file a fair bit. But still loaded the model in about 110 seconds. As it clears the RAM after moving it to the GPU. So it can work on a computer with less than 14GB RAM, but perhaps Google Colab doesn't have a page file? IDK. |
It's possible to make it work on the free version. Since colab gives you more GPU VRAM than RAM, what you'll want to do is load the checkpoint into CUDA rather than CPU. Once you've done that, split the state dict on the layers, save the sharded state dict, and then after freeing your GPU memory (or in another run) sequentially load each shard into the model on the GPU afterward, making sure to delete each shard once you're done. You'll save quite a bit of RAM during the loading process, and from there it should work. |
Update: able to run on google colab pro. Seems 12 GB of ram is the issue. |
I attempted loading on GPU, and still it is unable to fully load. CUDA out of memory. |
Here's a notebook that goes through the steps I just mentioned and works for me using colab pro's standard GPU (~15 GB VRAM) and regular RAM runtime (~12.7 GB RAM), which I think is identical to the free version but I'm not completely certain. If the free colab gives less VRAM than the pro standard, it may indeed be impossible, but it should at least use compute units more efficiently on pro: This uses a 15 GB T4 GPU. If you have colab pro, there's an option to run 13B that should work as well, though you'll have to be patient executing the second cell. Colab is slow to save files, so you may have to wait and check your drive to make sure that everything has saved as it should before proceeding. |
I've gotten this one notebook from a 4chan user to work for me on the free tier. It's VERY cumbersome to get working, but it does work. All I changed when I ran it was to not use Google Drive, and instead get the model from somebody who mirrored the model on Huggingface (Brave soul, but the model got flagged and is probably gonna vanish from there). It splits the model like I mentioned, so again, maybe if somebody could get it working with a TPU, and split the models like this notebook does, then maybe the higher parameter models could be workable without needing a Colab Pro subscription. |
I was able to run the model on Colab Pro. For this recommend you to switch on TPU (there are 35gb RAM). And add "low_cpu_mem_usage = True" in "from_pretrained". |
i was able to run it in normal colab but it is horribly slow because the model is linked to g drive can anyone help me make the loading times and the time the model takes to type out the response faster? link to code or commands becasue it is a linux environment -- https://colab.research.google.com/drive/1otfwOihFBtNznj7ZXqiUJV_OXPm_BnN3?usp=sharing |
I am writing this a few months later, but its easy to run the model if you use llama cpp and a quantized version of the model. You can even run a model over 30b if you did. |
I'm doing some edge computing research, mind if i ask how do you run it on the phone? |
llama.cpp supports Android. Ref: https://github.com/ggerganov/llama.cpp#android |
Interested to see if anyone is able to run on google colab. Seems like 16 GB should be enough and is granted often for colab free. Not sure if Colab Pro should do anything better, but if anyone is able to, advice would be much appreciated.
The text was updated successfully, but these errors were encountered: