2019-08-04

文本分类(三)-构建模型I-built-in-LSTM

这篇笔记记录使用tensorflow的built-in LSTM创建一个文本分类模型，数据来自文本预处理(二)词编码。

超参数

首先定义模型使用的超参数，使用tf.contrib.training.HParams()来管理,如下：

def huper_param():
    return tf.contrib.training.HParams(
            num_embedding_size=16,  #  每一个词所作embedding的向量长度
            encoded_length=50,      #  每一条编码后的样本切取或补充后的长度，
            num_word_threshold=20,  #  词频数<=该值，不考虑
            num_lstm_nodes=[32, 32],  # 每一层LSTM的节点个数 
            num_lstm_layers=2,       #  LSTM层数
            num_fc_nodes=32,         #  全连接层节点数
            batch_size=100,          #  每一次输入样本书
            learning_rate=0.001,    # 学习率
            clip_lstm_grads=1.0, )   #  设置LSTM梯度大小，防止梯度爆炸

默认每一个参数含初始值，各个参数的含义见注释。其中具体解释两个：

num_embedding_size：每一个词会用一个向量来表示，该值指明这个向量的大小。而且这个向量是被学习的。
clip_lstm_grads：这是LSTM梯度值的上限，当一个梯度值大于这个上限时，把这个值设置为上限值，来防止梯度爆炸。

调用该函数生成一个对象，可以用对象名.参数名来使用相应的参数：

1 2	hp = huper_param() encoded_length = hp.encoded_length

定义计算图

先定义输入：

1 2	inputs = tf.placeholder(tf.int32, (batch_size, encoded_length)) outputs = tf.placeholder(tf.int32, (batch_size, ))

定义DropOut比率：

1	keep_prob = tf.placeholder(tf.float32, name='keep_prob')

保存当前训练到了那一步：

1	global_step = tf.Variable(tf.zeros([], tf.int64), name='global_step', trainable=False)

1. Embedding层

使用均匀分布来初始化：

1	embedding_init = tf.random_uniform_initializer(-1.0, 1.)

定义embedding：

embedding = tf.get_variable(
            'embedding',
            [vocab_size, hps.embedding_size],   # size of embedding matrix
            tf.float32
)

说明：

使用get_varable()，当这个变量存在，就重用它，不存在，则创建它。
embedding矩阵：[vocab_size, hps.embedding_size]：一共有多少个词，每个词用多大的向量表示。

下一步，将每一条输入中每一个词对应的向量在embedding matrix中查找，比如，当前词id为12，就从embedding matrix中把第12行的向量取出来，对一条记录中每个词作此操作：

[2, 34, 5, 67]->[[234,565,1,45,57,73],
                 [12,76,23,54,123,48],
                 [43,87,239,57,13,14],
                 [98,64,421,13,63,36]]

用长度为6的向量表示一个词的id。可以看作是对每一条记录的进一步编码。而且这个编码是要被学习的。

给出完整embedding层：

embedding_init = tf.random_uniform_initializer(-1.0, 1.)
    with tf.variable_scope('embedding', initializer=embedding_init):
        embedding = tf.get_variable(
            'embedding',
            [vocab_size, hps.embedding_size],   # size of embedding matrix
            tf.float32
        )
        #
        embedded_inputs = tf.nn.embedding_lookup(embedding, inputs)

2. LSTM 层

定义initializer：

1 2	scale = 1.0/math.sqrt(hps.num_embedding_size + hps.nums_lstm_nodes[-1])/3.0 lstm_init = tf.random_uniform_initializer(-scale, scale)

即使用xavior初始化。

定义两层LSTM：

cells = []
for i in range(2):
    cell = tf.contrib.rnn.BasicLSTMCell(
        hps.nums_lstm_nodes[i],
        state_is_tuple=True
    )
    cell = tf.contrib.rnn.DropoutWrapper(
        cell,
        output_keep_prob=keep_prob
    )
    cells.append(cell)

cells接收每一层，使用BasicLSTMCell创建一LSTM层。紧接着使用DropoutWrapper执行DropOut操作。此时cells中含有两层LSTM。

然后使用MultiRNNCell合并两LSTM层，第一个cell的输出为第二个cell的输入：

1	cell = tf.contrib.rnn.MultiRNNCell(cells)

此时就可以把两层的LSTM当作模型中的一层来操作。

紧接着初始化LSTM单元中的state：

1	initialize_state = cell.zero_state(batch_size, tf.float32)

此时便可以使用dynamic_rnn把序列式的输入传入LSTM层，后得到一系列中间状态和输出值：

1	rnn_outputs, _ = tf.nn.dynamic_rnn(cell, embedded_inputs, initial_state=initialize_state)

其中_表示中间隐含状态，不需要。rnn_outputs中包含了所有中间输出。对于多对一的问题，我们只需要最后一个值：

1	last = rnn_outputs[:, -1, :]

给出完整的LSTM层：

scale = 1.0/math.sqrt(hps.num_embedding_size + hps.nums_lstm_nodes[-1])/3.0
lstm_init = tf.random_uniform_initializer(-scale, scale)
with tf.variable_scope('lstm', initializer=lstm_init):
    # store two LSTM layers
    cells = []
    for i in range(hps.num_lstm_layer):
        cell = tf.contrib.rnn.BasicLSTMCell(
            hps.nums_lstm_nodes[i],
            state_is_tuple=True
        )
        cell = tf.contrib.rnn.DropoutWrapper(
            cell,
            output_keep_prob=keep_prob
        )
        cells.append(cell)
    # combine two LSTM layers: 第一个cell的输出为第二个cell的输入
    cell = tf.contrib.rnn.MultiRNNCell(cells)
    # init state
    initialize_state = cell.zero_state(batch_size, tf.float32)
    # input a sentence to cell, 此时就可以把句子输入到cell中

    # rnn_outputs: [batch_size, encoded_length, lstm_output[-1]]
    rnn_outputs, _ = tf.nn.dynamic_rnn(
        cell, embedded_inputs, initial_state=initialize_state)
    last = rnn_outputs[:, -1, :]

3. 全连接层

使用dence()构建全连接层，指定ReLU为激活函数：

1	fc1 = tf.layers.dense(last, hps.num_fc_nodes, activation=tf.nn.relu, name='fc1')

紧接着进行dropOut操作：

1	fc1_dropout = tf.contrib.layers.dropout(fc1, keep_prob)

最后再接一个全连接层：

1	logits = tf.layers.dense(fc1_dropout, classes_size, name='fc2')

给出完整的全连接层：

fc_init = tf.uniform_unit_scaling_initializer(factor=1.0)
with tf.variable_scope('fc', initializer=fc_init):
    fc1 = tf.layers.dense(last, hps.num_fc_nodes, activation=tf.nn.relu, name='fc1')
    fc1_dropout = tf.contrib.layers.dropout(fc1, keep_prob)
    logits = tf.layers.dense(fc1_dropout, classes_size, name='fc2')

4. 模型输出

首先：

1	softmax_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=outputs)

tf.nn.sparse_softmax_cross_entropy_with_logits()做了三件事：

填坑
填坑
填坑

其次，传入代价函数，并且算出模型输出：

1 2	loss = tf.reduce_mean(softmax_loss) y_pred = tf.argmax(tf.nn.softmax(logits), 1, output_type=tf.int32)

最后，用最简单的正确率衡量模型性能：

1 2	correct_pred = tf.equal(outputs, y_pred) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

说明下面二者的不同：

tf.variable_scope：需要初始化
tf.name_scope：无需初始化

此部分完整实现：

with tf.name_scope('metrics'):
    softmax_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits=logits, labels=outputs
    )
    loss = tf.reduce_mean(softmax_loss)
    y_pred = tf.argmax(tf.nn.softmax(logits), 1, output_type=tf.int32)
    correct_pred = tf.equal(outputs, y_pred)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

5. 得到train_op

因为之前对梯度值设定了一个上界，所以要把截断后的梯度值得到，作用于所有可训练变量。所以第一步得到所有可训练变量：

1	trainable_vars = tf.trainable_variables()

可以查看所有的可训练变量：

1 2	for var in trainable_vars: print('variable name: %s' % (var.name))

对所有可训练变量求导数，得到实际梯度后对其执行剪切操作：

1	grads, _ = tf.clip_by_global_norm(tf.gradients(loss, trainable_vars), hps.clip_lstm_grads)

指定优化算法，应用剪切后的梯度于所有可训练变量。最后训练：

1 2	optimizer = tf.train.AdamOptimizer(hps.learning_rate) train_op = optimizer.apply_gradients(zip(grads, trainable_vars), global_step=global_step)

完整实现：

with tf.name_scope('train_op'):
    trainable_vars = tf.trainable_variables()
    for var in trainable_vars:
        print('variable name: %s' % (var.name))
    grads, _ = tf.clip_by_global_norm(
        tf.gradients(loss, trainable_vars), hps.clip_lstm_grads
    )
    optimizer = tf.train.AdamOptimizer(hps.learning_rate)
    train_op = optimizer.apply_gradients(
        zip(grads, trainable_vars), global_step=global_step
    )

6. 返回值

最后指定函数返回值：

1
2
3

return ((inputs, outputs, keep_prob),  #  all placeholders
        (loss, accuracy),              # loss & accuracy
        (train_op, global_step))      # tain_op

到此位置计算图设计完成。

假设上述定义计算图可以封装到函数：create_model()。测试一下：

from dataPreProcess import encodeWords
vocab_instance = encodeWords.VocabDict(vocab_file, hps.num_word_threshold)
catego_instance = encodeWords.CategoryDict(category_file)
placeholders, metrics, others = create_model(hps,
                                             vocab_instance.size(),
                                             catego_instance.size())

encodedWords中是在文本预处理(二)词编码篇实现的两个类VocabDict和CategoryDict。分别调用其.size()方法，可返回词数量和类别数量。

打印所有可训练变量，控制台结果：

variable name: <tf.Variable 'embedding/embedding:0' shape=(50513, 16) dtype=float32_ref>
variable name: <tf.Variable 'lstm/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0' shape=(48, 128) dtype=float32_ref>
variable name: <tf.Variable 'lstm/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0' shape=(128,) dtype=float32_ref>
variable name: <tf.Variable 'lstm/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0' shape=(64, 128) dtype=float32_ref>
variable name: <tf.Variable 'lstm/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0' shape=(128,) dtype=float32_ref>
variable name: <tf.Variable 'fc/fc1/kernel:0' shape=(32, 32) dtype=float32_ref>
variable name: <tf.Variable 'fc/fc1/bias:0' shape=(32,) dtype=float32_ref>
variable name: <tf.Variable 'fc/fc2/kernel:0' shape=(32, 10) dtype=float32_ref>
variable name: <tf.Variable 'fc/fc2/bias:0' shape=(10,) dtype=float32_ref>

注意，训练并没有执行计算，只是打印了计算图中的可训练变量。结果显示有三部分：

embedding层
两层LSTM的权值阈值
两层全连接层的权值和阈值

并且每部分参数的形状也可知。

执行计算流程

先执行create_model()：

placeholders, metrics, others = create_model(hps,
                                             vocab_instance.size(),
                                             catego_instance.size())
inputs, outputs, keep_prob = placeholders
loss, accuracy = metrics
train_op, global_step = others

然后初始化整个网络，给训练过程的keep_prob赋值，并指明训练步数：

1
2
3

init_op = tf.global_variables_initializer()
train_keep_prob = 0.8
num_train_steps = 1000

最后创建执行图的tf.session，并执行：

with tf.Session() as sess:
    # init whole network
    sess.run(init_op)
    for i in range(num_train_steps):
        batch_inputs, batch_label = train_dataset.next_batch(hps.batch_size)
        # training: global_step+1 when sess.run() is called
        outputs_val = sess.run([loss, accuracy, train_op, global_step],
                               feed_dict={
                                   inputs: batch_inputs,
                                   outputs: batch_label,
                                   keep_prob: train_keep_prob
                               })
        # get three values from output_val
        loss_val, accuracy_val, _, global_step_val = outputs_val

        # print for every 100 times
        if global_step_val % 20 == 0:
            print("step: %5d, loss: %3.3f, accuracy: %3.5f" %
                  (global_step_val, loss_val, accuracy_val)
                  )

其中用到了一个重要方法：给placeholders赋值。所以在运行之前使用训练集创建EncodedDataset对象，就可以调用train_dataset.next_batch(hps.batch_size)了：

1 2	train_dataset = createEncodedDataset.EncodedDataset( seg_train_file, vocab_instance, catego_instance, hps.encoded_length)

如下时执行1000次的结果：

step:   200, loss: 1.732, accuracy: 0.25000
step:   400, loss: 1.720, accuracy: 0.32000
step:   600, loss: 1.537, accuracy: 0.39000
step:   800, loss: 1.279, accuracy: 0.53000
step:  1000, loss: 0.994, accuracy: 0.67000

至少证明模型是正确的。完整实现看这里。
这是第一步，之后便可以进一步优化。本笔记只记录使用tf内置LSTM模块构建基本LSTM文本分类模型，对于优化，调参以后讨论。

最后一点*

在参数列表中有一项num_lstm_nodes，什么意思？！在构建LSTM时的核心函数是：

1	cell = tf.contrib.rnn.BasicLSTMCell( hps.nums_lstm_nodes[i], state_is_tuple=True)

敲黑板

查看官方文档：首个参数num_units：它表示LSTM单元内部的神经元数量，即输出神经元数。LSTM结点结构图中有5个主要非线性变换，他们中的每一个都相当于普通神经网络的的一个神经原，相对于解决异或问题只需要3个神经元(逻辑门)，解决复杂问题的网络神经元数量都远远不止一个。

相同道理，包含5个非线性变换的一个LSTM结点在解决复杂问题时一定也远不只需要一个。图中只是示意图，表示一个结点，实际上会有很多。从LSTM层使用xavior初始化的角度看，sqrt(hps.num_embedding_size + hps.nums_lstm_nodes[-1])定义代表sqrt(输入大小 + 输出大小)，hps.nums_lstm_nodes[-1]正对应这一层的输出大小。

诶，在图中也有多个LSTM节点呀！？，这些结点是逻辑上按时间序列展开的节点，空间上只有一个。这里讨论的是另一个维度的LSTM结点。并不矛盾。可以从下一节笔记的代码实现中体会。

理解这个很重要！