bytes
Functions to get the byte vocabulary from a HuggingFace tokenizer
check_byte_decoder(tokenizer, byte_decoder)
Verify that a byte decoder can properly handle all tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte_decoder
|
dict
|
Dictionary mapping characters to bytes |
required |
Raises:
Type | Description |
---|---|
ByteDecoderError
|
If byte decoder fails validation checks |
Source code in genlm_backend/tokenization/bytes.py
get_byte_tokens_by_encoding_token_strings(tokenizer)
Convert tokens to bytes by encoding token strings directly.
This function attempts to convert each token in the vocabulary to its byte representation by directly encoding the token strings. It handles special tokens separately and has multiple fallback strategies for encoding regular tokens:
- For special tokens, uses the string representation from the tokenizer's added vocab
- For regular tokens: a. If the token is already bytes, uses it directly b. If the token is a string and the tokenizer has convert_tokens_to_string: - Converts single token to string - Verifies roundtrip encoding matches original token ID - Falls back to byte decoder if roundtrip fails c. If the token is a string without convert_tokens_to_string: - Directly encodes the token string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance. |
required |
Returns:
Name | Type | Description |
---|---|---|
byte_tokens |
list[byte]
|
List of byte representations for each token in the vocabulary. |
Raises:
Type | Description |
---|---|
ValueError
|
If token encoding fails (roundtrip produces multiple tokens), or if a token has an unexpected type (not str or bytes). |
Source code in genlm_backend/tokenization/bytes.py
get_byte_tokens_from_byte_decoder(tokenizer, byte_decoder)
Convert tokens to bytes using a byte decoder mapping.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte_decoder
|
dict
|
Dictionary mapping characters to bytes |
required |
Returns:
Name | Type | Description |
---|---|---|
byte_tokens |
list[byte]
|
List of byte representations for each token |
Source code in genlm_backend/tokenization/bytes.py
get_byte_tokens_from_sp(tokenizer)
Convert tokens to their byte representations using a SentencePiece model.
Uses the SentencePiece model's id_to_piece method to get the raw byte representation of each token, handling special tokens separately. Converts any hex-encoded bytes (in <0xXX> format) to their actual byte values and replaces the SentencePiece prefix space marker with a regular space.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance with a SentencePiece model |
required |
Returns:
Name | Type | Description |
---|---|---|
byte_tokens |
list[byte]
|
List of byte representations for each token in the vocabulary |
Note
Special tokens are handled by directly encoding their string representation, while normal tokens go through the SentencePiece conversion process.
Source code in genlm_backend/tokenization/bytes.py
get_byte_vocab(tokenizer)
Extract byte vocabulary from a tokenizer using various methods.
This function attempts to extract the byte representation of each token in the vocabulary using multiple methods, trying each in sequence until one succeeds:
- If the tokenizer has a byte_decoder attribute, attempt to use that directly
- If the tokenizer has an sp_model (SentencePiece) attribute, use that
- Try encoding the token strings directly
- Fall back to using the default GPT2 byte decoder
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance. |
required |
Returns:
Type | Description |
---|---|
list[byte]
|
List of byte representations of tokens. |
Raises:
Type | Description |
---|---|
ByteVocabError
|
If vocabulary cannot be decoded using any of the available methods. |