Is there a way to cache the bpe when using `num_tokens_from_messages` #81

xd009642 · 2024-08-13T10:50:00Z

We've found that the num_tokens_from_messages function is a significant bottleneck in our application and from some benchmarking found that it seems to load the bpe every time it's called and this dominates the runtime. Is there a way to cache this so we can load it once and not take the pain of loading it each time or will it require a change to tiktoken-rs' internals (and if so what does the change have to be?)

Section of a flamegraph showing the breakdown of num_tokens_from_messages

The text was updated successfully, but these errors were encountered:

xd009642 · 2024-08-14T09:10:04Z

As an aside note for perf the singleton models having a mutex around them seems counterproductive. Generally, with model code you want your model to either be:

Callable from multiple threads/contexts with any mutable state stored in a separate type so the model type remains immutable
Cheap to clone (usually via reference counted params) so if mutable state can't be separated the model is reusable

Having mutexes around the singletons seems to drastically limit their utility given that the tokenisers are running on the CPU and not that complex.

Also for benchmarking you should look at criterion or divan, the rust benchmarking stuff won't be stable and provides less useful statistics/measurements

xd009642 · 2024-08-14T12:15:07Z

For some benchmarking results to back things up. We took the impl of num_tokens_from_message and passed in an already loaded model to avoid reloading and we got:

├─ token_counting                     │               │               │               │         │
│  ├─ Azure35           2.299 ms      │ 97.9 ms       │ 4.02 ms       │ 5.278 ms      │ 100     │ 100
│  ├─ Azure4            2.048 ms      │ 11.42 ms      │ 4.429 ms      │ 4.66 ms       │ 100     │ 100
│  ├─ Llama2            2.373 ms      │ 33.58 ms      │ 5.256 ms      │ 6.082 ms      │ 100     │ 100
│  ├─ OpenAI35          3.006 ms      │ 15.72 ms      │ 8.08 ms       │ 8.418 ms      │ 100     │ 100
│  ╰─ OpenAI4           3.22 ms       │ 30 ms         │ 7.46 ms       │ 8.057 ms      │ 100     │ 100

Using the version in this library we get:

Timer precision: 25 ns
assistant_benches       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ token_counting                     │               │               │               │         │
│  ├─ Azure35           82.79 ms      │ 174.3 ms      │ 92.74 ms      │ 97.51 ms      │ 100     │ 100
│  ├─ Azure4            208.7 ms      │ 694.2 ms      │ 238.2 ms      │ 274.8 ms      │ 100     │ 100
│  ├─ Llama2            83.74 ms      │ 123.6 ms      │ 99.9 ms       │ 99.07 ms      │ 100     │ 100
│  ├─ OpenAI35          83.85 ms      │ 125.5 ms      │ 97.92 ms      │ 98.48 ms      │ 100     │ 100
│  ╰─ OpenAI4           208.9 ms      │ 528.9 ms      │ 241.8 ms      │ 249.7 ms      │ 100     │ 100

OTheDev · 2024-10-14T11:03:22Z

I can reproduce the performance gains of using a pre-initialized BPE.

In fact, it would be possible to implement something like a num_tokens_from_messages_cached_bpe that does the initialization of the two chat BPEs in it's first call and retrieves references to initialized BPEs in subsequent calls. The arguments to this function would be exactly the same as that for num_tokens_from_messages (no BPE would need to be supplied). This makes it easier for users that don't want to get a BPE themselves.

xd009642 mentioned this issue Aug 14, 2024

Try to fix token counting perf. #82

Draft

xd009642 mentioned this issue Aug 14, 2024

Allow users to supply their own BPE tokeniser #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to cache the bpe when using `num_tokens_from_messages` #81

Is there a way to cache the bpe when using `num_tokens_from_messages` #81

xd009642 commented Aug 13, 2024

xd009642 commented Aug 14, 2024

xd009642 commented Aug 14, 2024

OTheDev commented Oct 14, 2024

Is there a way to cache the bpe when using num_tokens_from_messages #81

Is there a way to cache the bpe when using num_tokens_from_messages #81

Comments

xd009642 commented Aug 13, 2024

xd009642 commented Aug 14, 2024

xd009642 commented Aug 14, 2024

OTheDev commented Oct 14, 2024

Is there a way to cache the bpe when using `num_tokens_from_messages` #81

Is there a way to cache the bpe when using `num_tokens_from_messages` #81