BPE tokenizer in Rust - would love feedback from the community

Posted by farhan-dev@reddit | LocalLLaMA | View on Reddit | 8 comments

Hey everyone,

I've been working on a side project called Splintr - a BPE tokenizer written in Rust with Python bindings. It's compatible with OpenAI's tiktoken vocabularies (cl100k_base, o200k_base).

What it does:

Single text encoding: \~3-4x faster than tiktoken
Batch encoding: \~10-12x faster than tiktoken
Streaming decoder for real-time LLM output
54 special tokens for training and building chat/agent applications

Quick example:

pip install splintr-rs
from splintr import Tokenizer

tokenizer = Tokenizer.from_pretrained("cl100k_base")   
tokens = tokenizer.encode("Hello, world!")   
text = tokenizer.decode(tokens)

# Batch encode (where it really shines)

texts = ["Hello", "World"] * 1000   
batch_tokens = tokenizer.encode_batch(texts)

I spent some time benchmarking and optimizing - turns out sequential encoding beats parallel for most text sizes (Rayon overhead only pays off at \~1MB+). Sometimes simpler is faster.

GitHub: https://github.com/farhan-syah/splintr

Would really appreciate if you could give it a try and let me know:

Does it work for your use case?
Any issues or rough edges?
What features would be useful?

Still early days, but happy to hear any feedback. Thanks for reading!

[-]

ReturningTarzan@reddit

The main thing about tokenization is always correctness. Throughput is nice but secondary. A wishlist could be:

Thorough tests. Language models are robust to small differences in tokenization but can still silently lose performance if you don't get all the details right.
Ensuring control tokens added to the vocabulary after the tokenizer is trained are handled correctly (usually done by splitting the input into multiple BPE runs)
Correct trimming/padding/normalization rules for the added tokens
Correct preprocessing and postprocessing steps (regex+trimming/padding)
Correct and efficient decoding of single tokens, especially for tokens that don't decode to complete characters. Might want an API for decoding to byte strings rather than character strings, or a buffer/queue that accepts incoming bytes and outputs completed characters.

Chromix_@reddit

This! It's very easy to have (and keep) subtle bugs in tokenizers, as sometimes the degraded result quality is really only noticeable in benchmarks. llama.cpp had issues and drama with that.

andreclaudino@reddit

I am interested in helping, mainly in the rust API. I took a fast look, but didn't see how to train a tokenizer with it. Is that possible?

in the future, perhaps. I haven't have the idea for that yet. For now, I am focusing on using the current exisitng trained tokenizer first, because in my use case ,i am using this tokenizer to build dataset to train my LLM model. My priority add the current moment is to add the other vocab support first.