cache
OutputCache
A cache for storing tensor outputs with optional CPU offloading.
This cache stores tensors along with their original devices and can optionally move tensors to CPU to save GPU memory. When retrieving tensors, they are moved back to their original device.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
maxsize
|
int
|
Maximum number of items to store in the cache |
required |
move_to_cpu
|
bool
|
If True, tensors will be moved to CPU when cached |
False
|
Source code in genlm_backend/cache.py
TokenTrie
Class used internally to cache language model results.
The TokenTrie maintains a tree of token sequences, storing logits and key-value states for each path.