Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cmp-nct authored Jun 22, 2023
1 parent 27b3370 commit 4195227
Showing 1 changed file with 14 additions and 6 deletions.
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
llama.cpp modification to run Falcon (work in progress)
ggllm.cpp is a llama.cpp modification to run Falcon (work in progress)

**The Bloke features fine tuned weights in ggml v3 with various quantization options:**
https://huggingface.co/TheBloke/falcon-40b-instruct-GGML
Expand All @@ -14,11 +14,11 @@ https://huggingface.co/tiiuae/falcon-7b-instruct

**Conversion:**
1) use falcon_convert.py to produce a GGML v1 binary from HF - not recommended to be used directly
2) use examples/falcon_quantize to convert these into memory aligned GGMLv3 binaries of your choice including mmap support from there on
_Important: The Falcon 7B model features tensor sizes which are not yet supported by K-type quantizers - use the traditional quantization for those_
2) use examples/falcon_quantize to convert these into memory aligned GGMLv3 binaries of your choice including mmap support from there on
_The Falcon 7B model features tensor sizes which are not yet supported by K-type quantizers - use the traditional quantization for those_

**Status/Bugs:**
* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
Cummulative token slowdown over increasing context

**How to compile:**
```
Expand All @@ -30,9 +30,16 @@ rm -rf build; mkdir build; cd build
cmake -DLLAMA_CUBLAS=1 ..
cmake --build . --config Release
# find binaries in ./bin
```

# Troubles with CUDA not found on linux ?
```
export PATH="/usr/local/cuda/bin:$PATH"
cmake -DLLAMA_CUBLAS=1 -DCUDAToolkit_ROOT=/usr/local/cuda/ ..
```

2) Installing on WSL (Windows Subsystem for Linux)
```
# I am getting slightly better timings on WSL than native windows
# Use --no-mmap in WSL OR copy the model into native directory (not /mnt/) or it will get stuck loading (thanks @nauful)
#Choose a current distro:
Expand All @@ -49,7 +56,8 @@ export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda-12.1/bin:$PATH"
# now start with a fresh cmake and all should work
```


```
**CUDA:**
Only some tensors supported currently, only mul_mat operation supported currently
q3_k timing on 3090 of Falcon 40B:
Expand All @@ -72,7 +80,7 @@ CUDA sidenote:
It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
CPU inference examples:
```
Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0
main: build = 677 (dd3d346)
main: seed = 1687010794
Expand Down

0 comments on commit 4195227

Please sign in to comment.