将文本转换为模型可以处理的数据。那分词流程中具体发生了什么？内容包括分词算法，分词general流程。

分词算法

word-based：将原始文本分割成单词，并为每个单词找到一个数值表示。使用 split() 函数实现。
character-based：文本分割为字母，而不是单词
Subword tokenization：依赖的原则是，常用词不应被拆分成更小的子词，而罕见词应该被分解成有意义的子词。
Byte-level BPE：在 GPT-2 中使用
WordPiece：在 BERT 中使用的
SentencePiece 或 Unigram：在多语言模型中

分词流程，general 的流程

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

1. 编码：将文本转换为数字

分为两步骤：第一步：文本分割成的单词，或单词的一部分、标点符号等，称为 tokens。第二步：将tokens 转化为数字。

step1. 分词

调用 tokenize 方法，将文本分割成 tokens：

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

step2. 从 token 到 input ids

调用 convert_tokens_to_ids 方法，将 tokens 转换为数字：

ids = tokenizer.convert_tokens_to_ids(tokens)
[7993, 170, 11303, 1200, 2443, 1110, 3014]

2. 解码

使用 decode 方法。将 input ids 反过来转化为原始文本。

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
'Using a Transformer network is simple'

decode 方法不仅将 ids 转换回 tokens，还将属于同一单词的 tokens 组合在一起以生成可读的句子。

batch 的 input，和 attention mask

Transformers 期望处理一个 batch。

Transformer 模型的关键特性是注意力层，这些层为每个 token 供上下文。所以同样一句话，有 padding 和没有 padding 的结果的不同的。我们需要告诉这些注意力层忽略表示 padding 的 tokens。这是通过使用注意力掩码（ attention mask）来完成的。

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids =   [
                 [200, 200, 200],
                 [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)  
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

看，第二句话的两种输出结果不同，因为一个有padding ，一个没有，Attention 会将所有 token 都考虑在内，包括padding 。所以需要一个 Attention mask 来告诉模型忽略padding：

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]
# 这里
attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

这样两种方式结果就一样了。

更长的序列

大多数 Transformers 模型可以处理长达 512 或 1024 个 token 的序列，当要求它们处理更长的序列时会崩溃。所以要么选择一个支持更长序列的模型，要么截断你的太长的序列。

模型支持不同的序列长度，有些模型专门用于处理非常长的序列。Longformer 和 LED。

对于后者，建议通过指定 max_sequence_length 参数来截断的序列：sequence = sequence[:max_sequence_length] 避免崩溃。

transformer 库中 tokenizer 更灵活的用法

# 可以是一个句子
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

# 可以是一个batch 句子
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

不同的 padding方式

## 
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

截断参数

## 
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

tokenizer 对象可以处理转换为特定框架的张量. “pt” 返回 PyTorch 张量， “np” 返回 NumPy 数组：

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

注意：分词器在语句开头添加了特殊词 [CLS] ，在结尾添加了特殊词 [SEP] 。这是因为模型是使用这些词进行预训练的，所以为了在推理时得到相同的结果，我们也需要添加它们。

注意

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("gpt2")

上述两个对象不能用在一起，以为他们来自不同的预训练模型。分词器和模型应该始终来自同一个 Checkpoint。

Stay curious and keep asking questions! 🧠✨

分词算法#

分词流程，general 的流程#

1. 编码：将文本转换为数字#

step1. 分词#

step2. 从 token 到 input ids#

2. 解码#

batch 的 input，和 attention mask#

更长的序列#

transformer 库中 tokenizer 更灵活的用法#

注意#