How is llama.cpp or other implementations handle tokenization without tiktoken?

Posted by EricHermosis@reddit | LocalLLaMA | View on Reddit | 11 comments

Hi! I built my own tensor library in C++ and got llama3 working here, that means I created a simple server with sockets that can send and receive tensors from a python client, so I tokenize with tiktoken in the python client, send the tensor to my C++ transformer and get back the result.

I'm getting good results on llama3 1B, decent besides zero optimizations made yet, however I would like to get rid of python and make everything in C++. The problem is that tiktoken is rust/python. What do you think I should do? Try to implement it from scratch, look for someone else implementation? Try to use the original that is written in rust? How does llama.cpp or other implementations of llms handle this???

[-]

Longjumpingfish0403@reddit

If you're sticking with C++, exploring how llama.cpp's tokenizer could integrate directly with your setup might save time. If you're set on building your own, you could start with a simpler tokenization method for now and gradually refine it. It's a balance between quick progress and long-term capability.

[-]

pseudonerv@reddit

Did you read llama.cpp code?

https://github.com/ggml-org/llama.cpp/blob/master/src/llama-vocab.cpp

[-]

EricHermosis@reddit (OP)

Not really, I didn't note that the tokenizer was actually implemented in that single vocab file.

Don't really understand, it's running some kind of bpe tokenization, but does that produce same results as tiktoken? Seems like I should implement my own tokenizer, with my tensors.

[-]

ttkciar@reddit

The llama.cpp devs went through a few rounds of revising and bugfixes before arriving at a solution which works reliably.

Since they have already put in that work, making mistakes and correcting them, and their implementation is already in your preferred language (C++), perhaps you should just use it.

[-]

EricHermosis@reddit (OP)

I know and I really appreciate their work, will see if I can use that tokenizer without having the rest of the llama.cpp repo.

[-]

Linkpharm2@reddit

*ahem*

int main() {

productionReadyText = horribleText/3
}

That is all.

[-]

EricHermosis@reddit (OP)

Sorry don't understand

[-]

Linkpharm2@reddit

Most people get lazy and forget about accurate tokenization and go for an approximation, which is length divided by three.

[-]

EricHermosis@reddit (OP)

That is actually a very good idea I didn't think about it, I'm not shipping a production grade llm just an example of how to use my tensor library... I can create a good interface for a tokenzier there with that length / 3 implementation just to get the example working... Then move on implementing a real tokenizer else where for future projects.

How far do you think I can get with the length divided by three tokenizer?

[-]

Linkpharm2@reddit

I'm not sure, I'm not familiar with what you're trying to do. The obvious drawbacks is some bugs around odd text that doesnt fit length/3.

[-]

EricHermosis@reddit (OP)

I'm trying to create a machine learning ecosystem in C++, started with a tensor library, then a nn library and implemented some simple neural networks like a pretrained ViT or LLaMA3 as examples.

I really don't want to spend few months building something like tiktoken from scratch in C++ right now and I didn't decide how to tackle the tokenizer issue, however a simple aproximation just for the examples can be really usefull to get things done, remove python from llama3 example and move on and solve the tokenizer issue later.

Having the model saying something consistent is enough for me to gain credibility, better than setting up a whole python client just to try out the C++ transformer.