2019-08-03

文本分类(二)-文本预处理II-词编码

接文本预处理(一)，这篇笔记记录把分词后的记录编码为计算机可识别的数值型序列，如将[你好呀，参加比赛了吗]编码为[91, 57, 1, 31, 14, 6, 5]。同时编码类别。

词编码

在上一篇笔记中得到的vocab_file，再看一下其内容：

1 <UNK>   10000000
2 ，  1871208
3 的  1390830
4 。  822140
5 在  303879
6 、  258508
7 了  248160
8 是  240938
...   
508 同样    5874
509 正式    5868
510 故事    5867
511 13  5855
512 建筑    5854
513 代表    5850
514 主持人  5843
515 水平    5833
...
359234 各偏    1
359235 1.8782  1
359236 1.0307  1
359237 0.763   1
359238 87.82%  1
359239 0.5376  1

每个词与其编号一一对应，可以用词的标号来对词进行编码。这就是下面主要做的事。

1. 一一对应

首先读取vocab_file，得到每个词，其编号，其频数，以{词：编号}为元素存入一个字典。同时指定一个整数threshold，当一个词的词频小于threshold，不再考虑该词，原因是频数太低的词没有统计意义。对每一条数据执行此操作，实现过程如下：

def _read_dict(self, filename):
    """ read filename and generate {word: id} dict """
    with open(filename, 'r') as f:
        lines = f.readlines()
    for line in lines:    # 读词与词频
        word, frequency = line.strip('\r\n').split('\t')
        frequency = int(frequency)
        if frequency < self._num_word_threshold:
            continue
        idx = len(self._word_to_id)  # idx随_word_to_id大小递增
        if word == '<UNK>':   # 特殊处理UNK
            self._unk = idx
        self._word_to_id[word] = idx  # 构建处字典：{词： idx}

最终得到目标dict：

{'<UNK>': 0, '，': 1, '的': 2, '。': 3, '在': 4, '、': 5, '了': 6, '是': 7,...,'铭记': 20350, '多时': 20351, '轩然大波': 20352,...,'孵化器': 39545, '党史': 39546, '纸飞机': 39547,...}

2. 编码

第二步，对于一个清洗后样本[你好呀，参加比赛了吗]，从上一步得到的dict中找每一个key对应的value，从而生成[91, 57, 1, 31, 14, 6, 5]。实现过程：

def sentence_to_id(self, sentence):
    """ 用词的idx编码每个句子 """
    # 切分句子后的每个词，找它的idx，
    word_ids = [self.word_to_id(cur_word) for cur_word in sentence.split()]
    return word_ids

现在用一个样本做测试：

1	test_word = '你好呀，参加比赛了吗'

返回实际结果：

1	[9901, 5667, 1, 381, 124, 6, 445]

同样的方法，处理label，如：

科技的id： 8

上述完整过程在这里

到此到此为止，所有的样本都已用数字编码。下一步为模型提供数据，batch by batch

3. 生成batch

首先读取清洗后的样本文件，对于每一条记录中的类别和内容按照上述方式编码，并分别存储与两个list。编码完一条记录，加入到list中。最终的到的内容list和类别list。

实现如下：

with open(filename, 'r') as f:
    lines = f.readlines()
for line in lines:    # for each line,
    label, content = line.strip('\r\n').split('\t')
    # convert label and content to sequence of ids
    id_label = self._catego_dict.category_to_id(label)
    id_words = self._vocab_dict.sentence_to_id(content)
    id_words = id_words[0: self._encoded_length]           # cut
    padding_length = self._encoded_length - len(id_words)  # pad
    id_words = id_words + [self._vocab_dict.unk for _ in range(padding_length)]

    self._input.append(id_words)
    self._output.append(id_label)
# convert to numpy array
self._input = np.asarray(self._input, dtype=np.int32)
self._output = np.asarray(self._output, dtype=np.int32)

self._input中存储编码后的每一条内容，self._output中存储编码后的每一条对应的类别。特别说明，因为实际每条编码后的记录长度都不同，有的长，有的短。所以在代码中self._encoded_length表示每条记录保留多少个词。长的切去，短的用-1补全。

额外地对self._input和self._output做一个随机洗牌操作：

1
2
3

p = np.random.permutation(len(self._input))
self._input = self._input[p]
self._output = self._output[p]

这个操作使每个batch中数据分布尽可能一致，尽可能可以代表整个数据集的分布。

第二步，生成batch。其过程如下图：

洗牌后继续取下一个batch

当图中最后的数据#*不足一个batch时，所有数据随机洗牌，这样就可以再得到一个batch。当然最后一个batch中有部分重复使用，这没关系。代码实现如下：

def next_batch(self, batch_size):
    """
    get next batch data
    :param batch_size:
    :return: the next batch of input and output
    """
    end_indicator = self._indicator + batch_size
    if end_indicator > len(self._input):
        self._random_shuffle()
        self._indicator = 0
        end_indicator = batch_size
    if end_indicator > len(self._input):
        raise Exception("batch size : %d is too large" % batch_size)

    batch_input = self._input[self._indicator: end_indicator]
    batch_ouput = self._output[self._indicator: end_indicator]
    self._indicator = end_indicator
    # return what we require
    return batch_input, batch_ouput

测试：当_encoded_length为50，batch_size为2时，可能输出如下：

(array([[ 5639,  5529, 28692, 14277,   108,     0,   825,    87,  7763,
        22153, 17930,    17,   250,    16,   156,   481,   456,    45,
            6,   102,    45,     1,  5639, 20799,    30, 39057,   949,
         3640, 22153,    92,    15,  5639,    30, 20562, 21187,    14,
         3193, 18589,    30,  5529,    17, 24157,     0,    16,   831,
         3810,     4,  4406,  2849,  3092],
       [24102, 11066,  1375,    52,  7379,   224,  1027,   956,  2962,
        17510,    17,   250,    16,     4,   138, 13230,     2, 10477,
         1375,    21,     1,    13,   119,   406,  3380,    41,     1,
        13230,    17,  1647,    16, 32614,     2,     0,   147,   216,
            6,  5447,     5,     0,   136,     5,  8616,     5, 41351,
        25425,   136,     5, 12401,    45]], dtype=int32), array([1, 4], dtype=int32))

两条记录，第一条记录的类别为1，另一条的类别为4。