BPE tokenizer in Rust - would love feedback from the community
Posted by farhan-dev@reddit | LocalLLaMA | View on Reddit | 8 comments
Hey everyone,
I've been working on a side project called Splintr - a BPE tokenizer written in Rust with Python bindings. It's compatible with OpenAI's tiktoken vocabularies (cl100k_base, o200k_base).
What it does:
- Single text encoding: \~3-4x faster than tiktoken
- Batch encoding: \~10-12x faster than tiktoken
- Streaming decoder for real-time LLM output
- 54 special tokens for training and building chat/agent applications
Quick example:
pip install splintr-rs
from splintr import Tokenizer
tokenizer = Tokenizer.from_pretrained("cl100k_base")
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
# Batch encode (where it really shines)
texts = ["Hello", "World"] * 1000
batch_tokens = tokenizer.encode_batch(texts)
I spent some time benchmarking and optimizing - turns out sequential encoding beats parallel for most text sizes (Rayon overhead only pays off at \~1MB+). Sometimes simpler is faster.
GitHub: https://github.com/farhan-syah/splintr
Would really appreciate if you could give it a try and let me know:
- Does it work for your use case?
- Any issues or rough edges?
- What features would be useful?
Still early days, but happy to hear any feedback. Thanks for reading!
ReturningTarzan@reddit
The main thing about tokenization is always correctness. Throughput is nice but secondary. A wishlist could be:
Chromix_@reddit
This! It's very easy to have (and keep) subtle bugs in tokenizers, as sometimes the degraded result quality is really only noticeable in benchmarks. llama.cpp had issues and drama with that.
andreclaudino@reddit
I am interested in helping, mainly in the rust API. I took a fast look, but didn't see how to train a tokenizer with it. Is that possible?
farhan-dev@reddit (OP)
in the future, perhaps. I haven't have the idea for that yet. For now, I am focusing on using the current exisitng trained tokenizer first, because in my use case ,i am using this tokenizer to build dataset to train my LLM model. My priority add the current moment is to add the other vocab support first.
DeltaSqueezer@reddit
This is great! I'd be interested in more support for other popular vocabs.
farhan-dev@reddit (OP)
will do, since the core is already there, i will add the rest of the vocabs one by one and test it.
__Maximum__@reddit
Oh man, my favourite sub on reddit!
Looobay@reddit
Have you test it against tokenizers ? It's written in Rust and maintain by HuggingFace.
Anyway, congrats thats a cool projet!