Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anyone able to run 7B on google colab? #120

Closed
andrewmlu opened this issue Mar 5, 2023 · 13 comments
Closed

Anyone able to run 7B on google colab? #120

andrewmlu opened this issue Mar 5, 2023 · 13 comments

Comments

@andrewmlu
Copy link

Interested to see if anyone is able to run on google colab. Seems like 16 GB should be enough and is granted often for colab free. Not sure if Colab Pro should do anything better, but if anyone is able to, advice would be much appreciated.

@reycn
Copy link

reycn commented Mar 5, 2023

Not for free users. The model must be loaded into the CPU memory (if we are talking about this repo) but Colab only provides a capacity of less than 13G. Therefore, it cannot be run on the free version of Colab.

However, it may work if you have a Pro subscription. I look forward to hearing about your outcome.

@Daviljoe193
Copy link

Daviljoe193 commented Mar 5, 2023

What about loading it onto the TPU? KoboldAI can load 20B models (Like Erebus 20B) just fine to TPU, barring the fact it takes about 15 minutes to load a 40~ GB model, which Erebus specifically is split into 23 parts, though once running, they're pretty fast. Maybe there's a way to reduce the per-segment size from 15~ GB down to 2 GB, by splitting it into smaller parts, the same way Erebus is.

@elephantpanda
Copy link

I got it to run on a Shadow PC #105 which has only 12GB RAM. So it crunched the page file a fair bit. But still loaded the model in about 110 seconds. As it clears the RAM after moving it to the GPU.

So it can work on a computer with less than 14GB RAM, but perhaps Google Colab doesn't have a page file? IDK.

@brendan-donohoe
Copy link

brendan-donohoe commented Mar 5, 2023

It's possible to make it work on the free version. Since colab gives you more GPU VRAM than RAM, what you'll want to do is load the checkpoint into CUDA rather than CPU. Once you've done that, split the state dict on the layers, save the sharded state dict, and then after freeing your GPU memory (or in another run) sequentially load each shard into the model on the GPU afterward, making sure to delete each shard once you're done. You'll save quite a bit of RAM during the loading process, and from there it should work.

@andrewmlu
Copy link
Author

Update: able to run on google colab pro. Seems 12 GB of ram is the issue.

@andrewmlu
Copy link
Author

It's possible to make it work on the free version. Since colab gives you more GPU VRAM than RAM, what you'll want to do is load the checkpoint into CUDA rather than CPU. Once you've done that, split the state dict on the layers, save the sharded state dict, and then after freeing your GPU memory (or in another run) sequentially load each shard into the model on the GPU afterward, making sure to delete each shard once you're done. You'll save quite a bit of RAM during the loading process, and from there it should work.

I attempted loading on GPU, and still it is unable to fully load. CUDA out of memory.

@brendan-donohoe
Copy link

brendan-donohoe commented Mar 5, 2023

Here's a notebook that goes through the steps I just mentioned and works for me using colab pro's standard GPU (~15 GB VRAM) and regular RAM runtime (~12.7 GB RAM), which I think is identical to the free version but I'm not completely certain. If the free colab gives less VRAM than the pro standard, it may indeed be impossible, but it should at least use compute units more efficiently on pro:

https://pastebin.com/Le2zaJCy

This uses a 15 GB T4 GPU. If you have colab pro, there's an option to run 13B that should work as well, though you'll have to be patient executing the second cell. Colab is slow to save files, so you may have to wait and check your drive to make sure that everything has saved as it should before proceeding.

@Daviljoe193
Copy link

Daviljoe193 commented Mar 6, 2023

I've gotten this one notebook from a 4chan user to work for me on the free tier. It's VERY cumbersome to get working, but it does work. All I changed when I ran it was to not use Google Drive, and instead get the model from somebody who mirrored the model on Huggingface (Brave soul, but the model got flagged and is probably gonna vanish from there). It splits the model like I mentioned, so again, maybe if somebody could get it working with a TPU, and split the models like this notebook does, then maybe the higher parameter models could be workable without needing a Colab Pro subscription.

@usmanovaa
Copy link

I was able to run the model on Colab Pro.
It took 27gb RAM for me.

For this recommend you to switch on TPU (there are 35gb RAM). And add "low_cpu_mem_usage = True" in "from_pretrained".

@zeeboi9
Copy link

zeeboi9 commented May 7, 2023

Interested to see if anyone is able to run on google colab. Seems like 16 GB should be enough and is granted often for colab free. Not sure if Colab Pro should do anything better, but if anyone is able to, advice would be much appreciated.

i was able to run it in normal colab but it is horribly slow because the model is linked to g drive can anyone help me make the loading times and the time the model takes to type out the response faster?

link to code or commands becasue it is a linux environment -- https://colab.research.google.com/drive/1otfwOihFBtNznj7ZXqiUJV_OXPm_BnN3?usp=sharing

@johnwick123f
Copy link

I am writing this a few months later, but its easy to run the model if you use llama cpp and a quantized version of the model. You can even run a model over 30b if you did.
You don't even need colab. On my phone, its possible to run a 3b model and it outputs 1 token or half per second which is slow but pretty surprising its working on my phone!

@liushiyi1994
Copy link

liushiyi1994 commented Jul 21, 2023

I am writing this a few months later, but its easy to run the model if you use llama cpp and a quantized version of the model. You can even run a model over 30b if you did. You don't even need colab. On my phone, its possible to run a 3b model and it outputs 1 token or half per second which is slow but pretty surprising its working on my phone!

I'm doing some edge computing research, mind if i ask how do you run it on the phone?

@windmaple
Copy link

I am writing this a few months later, but its easy to run the model if you use llama cpp and a quantized version of the model. You can even run a model over 30b if you did. You don't even need colab. On my phone, its possible to run a 3b model and it outputs 1 token or half per second which is slow but pretty surprising its working on my phone!

I'm doing some edge computing research, mind if i ask how do you run it on the phone?

llama.cpp supports Android. Ref: https://github.com/ggerganov/llama.cpp#android

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests