For those who don't know what different model formats (GGUF, GPTQ, AWQ, EXL2, etc.) mean ↓

Posted by nderstand2grow@reddit | LocalLLaMA | View on Reddit | 10 comments

GGML and GGUF refer to the same concept, with GGUF being the newer version that incorporates additional data about the model. This enhancement allows for better support of multiple architectures and includes prompt templates. GGUF can be executed solely on a CPU or partially/fully offloaded to a GPU. By utilizing K quants, the GGUF can range from 2 bits to 8 bits. Previously, GPTQ served as a GPU-only optimized quantization method. However, it has been surpassed by AWQ, which is approximately twice as fast. The latest advancement in this area is EXL2, which offers even better performance. Typically, these quantization methods are implemented using 4 bits. Safetensors and PyTorch bin files are examples of raw float16 model files. These files are primarily utilized for continued fine-tuning purposes. pth can include Python code (PyTorch code) for inference. TF includes the complete static graph.