vocab
Functions to get and check HuggingFace tokenizer vocabularies
assert_roundtrip(test_case, tokenizer, vocab, vocab_type)
Assert that encoding and decoding a test case matches the tokenizer's output.
A unified function that handles both string and byte vocabularies.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_case
|
str
|
String to test encoding/decoding roundtrip |
required |
tokenizer
|
Hugging Face tokenizer instance |
required | |
vocab
|
list
|
List of token representations (either strings or bytes) |
required |
vocab_type
|
str
|
Type of vocabulary - either 'str' or 'byte' |
required |
Raises:
Type | Description |
---|---|
AssertionError
|
If the roundtrip result doesn't match tokenizer's direct decoding |
ValueError
|
If vocab_type is not 'str' or 'byte' |
Source code in genlm_backend/tokenization/vocab.py
assert_roundtrip_bytes(test_case, tokenizer, byte_vocab)
Assert that encoding and decoding a test case using byte vocabulary matches the tokenizer's output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_case
|
str
|
String to test encoding/decoding roundtrip |
required |
tokenizer
|
Hugging Face tokenizer instance |
required | |
byte_vocab
|
list
|
List of byte representations of tokens |
required |
Raises:
Type | Description |
---|---|
AssertionError
|
If the roundtrip result doesn't match tokenizer's direct decoding |
Source code in genlm_backend/tokenization/vocab.py
assert_roundtrip_strs(test_case, tokenizer, str_vocab)
Assert that encoding and decoding a test case using string vocabulary matches the tokenizer's output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_case
|
str
|
String to test encoding/decoding roundtrip |
required |
tokenizer
|
Hugging Face tokenizer instance |
required | |
str_vocab
|
list
|
List of string representations of tokens |
required |
Raises:
Type | Description |
---|---|
AssertionError
|
If the roundtrip result doesn't match tokenizer's direct decoding |
Source code in genlm_backend/tokenization/vocab.py
bytes_to_strs(tokenizer, byte_vocab, byte2str_fallback)
Convert byte representations to UTF-8 strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte_vocab
|
list[bytes]
|
List of byte representations of tokens |
required |
byte2str_fallback
|
str
|
Strategy for converting invalid UTF-8 bytes to strings: - 'tokenizer': Use tokenizer's convert_ids_to_tokens (default) - 'latin1': Decode using latin1 encoding - 'replace': Use Unicode replacement character '�' |
required |
Returns:
Type | Description |
---|---|
list[str]
|
List of string representations of tokens |
Note
May produce duplicate strings for different token IDs. A warning is issued if duplicates are found.
Source code in genlm_backend/tokenization/vocab.py
decode_vocab(tokenizer, byte2str_fallback='tokenizer')
Convert tokenizer vocabulary into byte and string representations.
Warning
The byte representation is the canonical form. The string representation is provided for convenience but may not decode properly for all tokens, especially those containing invalid UTF-8 sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte2str_fallback
|
str
|
Strategy for converting invalid UTF-8 bytes to strings. Options:
|
'tokenizer'
|
Returns:
Type | Description |
---|---|
tuple
|
(byte_vocab, str_vocab) |