Wanting to try out L3-70b-euryale on my computer, but don't know what version to choose

Posted by EEEEEEEEEEEEEEEE_Man@reddit | LocalLLaMA | View on Reddit | 2 comments

So I'm interested in using l3-70b-euryale as a chatbot, but I don't know what model to choose. I checked on what to choose on for performance, and other guides, but it's WAY too confusing to follow, like there's zero examples on what to choose on a pc build. And I'm pretty new with local AI

Specs:

CPU: AMD Ryzen 7 5700X

Ram: 16GB of DDR4

GPU: RTX 3070

OS: Windows 10

[-]

Linkpharm2@reddit

You can't run this model. L3-70b is the model. 70b refers to 70 billion numbers. The most you can compress this down to is about 24gb. You have 8?12? I forget. You can only use models that fit in your vram.

VRAM Requirements (GB):

Q8 is 8 bits per byte, so 1B per 1GB, not counting other components needed to run inference. Q4 is half the usage, fp16 is double, etc.

LLM Size | Q8    | Q6    | Q5    | Q4    | Q3    | Q2    | Q1
---------|-------|-------|-------|-------|-------|-------|-------
3B       |   3.3 |   2.5 |   2.1 |   1.7 |   1.3 |   0.9 |   0.6
7B       |   7.7 |   5.8 |   4.8 |   3.9 |   2.9 |   1.9 |   1.3
8B       |   8.8 |   6.6 |   5.5 |   4.4 |   3.3 |   2.2 |   1.5
9B       |   9.9 |   7.4 |   6.2 |   5.0 |   3.7 |   2.5 |   1.7
12B      |  13.2 |   9.9 |   8.3 |   6.6 |   5.0 |   3.3 |   2.2
13B      |  14.3 |  10.7 |   8.9 |   7.2 |   5.4 |   3.6 |   2.4
14B      |  15.4 |  11.6 |   9.6 |   7.7 |   5.8 |   3.9 |   2.6
21B      |  23.1 |  17.3 |  14.4 |  11.6 |   8.7 |   5.8 |   3.9
22B      |  24.2 |  18.2 |  15.1 |  12.1 |   9.1 |   6.1 |   4.1
27B      |  29.7 |  22.3 |  18.6 |  14.9 |  11.2 |   7.4 |   5.0
33B      |  36.3 |  27.2 |  22.7 |  18.2 |  13.6 |   9.1 |   6.1
65B      |  71.5 |  53.6 |  44.7 |  35.8 |  26.8 |  17.9 |  11.9
70B      |  77.0 |  57.8 |  48.1 |  38.5 |  28.9 |  19.3 |  12.8
74B      |  81.4 |  61.1 |  50.9 |  40.7 |  30.5 |  20.4 |  13.6
105B     | 115.5 |  86.6 |  72.2 |  57.8 |  43.3 |  28.9 |  19.3
123B     | 135.3 | 101.5 |  84.6 |  67.7 |  50.7 |  33.8 |  22.6
205B     | 225.5 | 169.1 | 141.0 | 112.8 |  84.6 |  56.4 |  37.6
405B     | 445.5 | 334.1 | 278.4 | 222.8 | 167.1 | 111.4 |  74.3

Perplexity Divergence:

Metric                  | FP16                 | Q8           | Q6         | Q5        | Q4      | Q3     | Q2    | Q1
------------------------|----------------------|--------------|------------|-----------|---------|--------|-------|------
Example chance of token | 12.1234567890123456% | 12.12345678% | 12.123456% | 12.12345% | 12.123% | 12.12% | 12.1% | 12%
Loss                    | 0%                   | 0.06%        | 0.1        | 0.3       | 1.0     | 3.7    | 8.2   | 70≅%

[-]

alamacra@reddit

By your own (very convenient!) table, it will run in Q2 if the guy uses koboldcpp / llamacpp / any other launcher than can offload layers to RAM. It will be slow though, since it'll be DDR4 dual channel at most, so about 1 token/s if you are lucky.

So I suggest to try it, but I suspect the speed won't be enough. The quality is actually surprisingly decent for the 70b Q2 models.