caffe-数据&模型-模型输出log

当执行训练后:

1 2	~/caffe-master/build/tools/caffe train \ --solver=/media/junhui/DATA/caffe_workspace/my_linearReggresion/lr_solver.prototxt

训练过程会在终端打印，终端日志信息以glog的格式输出：这个格式包括当前时间，进程号，源码行号，代码行号，以及输出信息，这个信息用于观察网络当前执行到哪一步。来分析一下使用Linear Reggresion对mnist分类，这个例子虽小，但五脏俱全。在必要的地方用做了注释，如下：

Reggresion/lr_solver.prototxt 
I0610 04:53:35.880447 13919 caffe.cpp:204] Using GPUs 0
I0610 04:53:35.903647 13919 caffe.cpp:209] GPU 0: GeForce GTX 1050
# 解析训练超参数文件lr_solver.prototxt，并初始化solver
I0610 04:53:36.096128 13919 solver.cpp:45] Initializing solver from parameters: 
test_iter: 100
test_interval: 500 
base_lr: 0.01
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "my_lr"
solver_mode: GPU
device_id: 0
net: "/media/junhui/DATA/caffe_workspace/my_linearReggresion/mylr.prototxt"
train_state {
  level: 0
  stage: ""
}

# ～～～～～～～～～～～～～～～～～训练网络构建开始～～～～～～～～～～～～～～～～～
# 创建蓝图中的Net
I0610 04:53:36.096751 13919 solver.cpp:102] Creating training net from net file: /media/junhui/DATA/caffe_workspace/my_linearReggresion/mylr.prototxt
# 这里指出，用于TEST的数据层和accuracy层，的phase值为“TEST”，将不在“TRAIN”阶段使用
I0610 04:53:36.097002 13919 net.cpp:296] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist
I0610 04:53:36.097012 13919 net.cpp:296] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
# 解析网络结构参数文件mylr.prototxt，并初始化用作TRAIN的Net
I0610 04:53:36.097085 13919 net.cpp:53] Initializing net from parameters: 
# 这里的3层layer堆叠起来才是用于训练的网络，从下面看，TEST网络就很清晰了
name: "lrNet"
state {
  phase: TRAIN
  level: 0
  stage: ""
}
# 1. 数据层，生成两个top： LMDB->“data”&“label”
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.0039063
  }
  data_param {
    source: "/media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_train_lmdb"
    batch_size: 64
    backend: LMDB
  }
}
# 2. 全连接层，"data"->"ip"
layer {
  name: "ip"
  type: "InnerProduct"
  bottom: "data"
  top: "ip"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
# 3. softmax 层："ip"&"label"->"loss"
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip"
  bottom: "label"
  top: "loss"
}
# 开始训练
# 从LMDB文件中读取训练数据，
I0610 04:53:36.097178 13919 layer_factory.hpp:77] Creating layer mnist
I0610 04:53:36.097410 13919 db_lmdb.cpp:35] Opened lmdb /media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_train_lmdb
# 1. 创建数据层，产生两个数据对象，“data”&“label”
I0610 04:53:36.097434 13919 net.cpp:86] Creating Layer mnist
I0610 04:53:36.097456 13919 net.cpp:382] mnist -> data
I0610 04:53:36.097476 13919 net.cpp:382] mnist -> label
# 数据行输出的大小为[64,1,28,28]
I0610 04:53:36.098232 13919 data_layer.cpp:45] output data size: 64,1,28,28
I0610 04:53:36.099445 13919 net.cpp:124] Setting up mnist
I0610 04:53:36.099474 13919 net.cpp:131] Top shape: 64 1 28 28 (50176)
I0610 04:53:36.099484 13919 net.cpp:131] Top shape: 64 (64)
# 统计内存占用，这个值会在train过程中累积
I0610 04:53:36.099489 13919 net.cpp:139] Memory required for data: 200960
# 2. 创建ip，全连接层
I0610 04:53:36.099498 13919 layer_factory.hpp:77] Creating layer ip
I0610 04:53:36.099509 13919 net.cpp:86] Creating Layer ip
# 从“data”生层“ip”，就是这层的输出
I0610 04:53:36.099529 13919 net.cpp:408] ip <- data
I0610 04:53:36.099541 13919 net.cpp:382] ip -> ip
I0610 04:53:36.100448 13919 net.cpp:124] Setting up ip
I0610 04:53:36.100461 13919 net.cpp:131] Top shape: 64 10 (640)
I0610 04:53:36.100478 13919 net.cpp:139] Memory required for data: 203520
# 3. 创建最后一层得到loss
I0610 04:53:36.100493 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.100505 13919 net.cpp:86] Creating Layer loss
# 该层输入为“ip”&“label”输出为“loss”
I0610 04:53:36.100510 13919 net.cpp:408] loss <- ip
I0610 04:53:36.100517 13919 net.cpp:408] loss <- label
I0610 04:53:36.100523 13919 net.cpp:382] loss -> loss
I0610 04:53:36.100535 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.643620 13919 net.cpp:124] Setting up loss
# 输出的loss大小为1，其权值为1
I0610 04:53:36.643661 13919 net.cpp:131] Top shape: (1)
I0610 04:53:36.643664 13919 net.cpp:134]     with loss weight 1
# 目前所占内存空间 200MB
I0610 04:53:36.643699 13919 net.cpp:139] Memory required for data: 203524
# 从后先前执行反向计算，哪里需要计算，就算哪里
I0610 04:53:36.643705 13919 net.cpp:200] loss needs backward computation.
I0610 04:53:36.643714 13919 net.cpp:200] ip needs backward computation.
I0610 04:53:36.643719 13919 net.cpp:202] mnist does not need backward computation.
# TRAIN网络只输出loss
I0610 04:53:36.643726 13919 net.cpp:244] This network produces output loss
I0610 04:53:36.643734 13919 net.cpp:257] Network initialization done.
# ～～～～～～～～～～～～～～～～～训练网络构建结束～～～～～～～～～～～～～～～～～

# ～～～～～～～～～～～～～～～～～测试网络构建开始～～～～～～～～～～～～～～～～～
I0610 04:53:36.644055 13919 solver.cpp:190] Creating test net (#0) specified by net file: /media/junhui/DATA/caffe_workspace/my_linearReggresion/mylr.prototxt
# 这里指出，用于TRAIN的数据层的phase值为“TRAIN”，将不在“TEST”阶段使用
I0610 04:53:36.644089 13919 net.cpp:296] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist
I0610 04:53:36.644132 13919 net.cpp:53] Initializing net from parameters: 
# 同样的，给出完整的TEST网络的结构
name: "lrNet"
state {
  phase: TEST # 用于TEST
}
# 数据层
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    scale: 0.0039063
  }
  data_param {
    source: "/media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_test_lmdb"
    batch_size: 100
    backend: LMDB
  }
}
# 全连接层
layer {
  name: "ip"
  type: "InnerProduct"
  bottom: "data"
  top: "ip"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
# 计算accuracy
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
# 计算loss
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip"
  bottom: "label"
  top: "loss"
}
I0610 04:53:36.644286 13919 layer_factory.hpp:77] Creating layer mnist
I0610 04:53:36.644841 13919 db_lmdb.cpp:35] Opened lmdb /media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_test_lmdb
I0610 04:53:36.644881 13919 net.cpp:86] Creating Layer mnist
I0610 04:53:36.644889 13919 net.cpp:382] mnist -> data
I0610 04:53:36.644898 13919 net.cpp:382] mnist -> label
I0610 04:53:36.645038 13919 data_layer.cpp:45] output data size: 100,1,28,28
I0610 04:53:36.646373 13919 net.cpp:124] Setting up mnist
I0610 04:53:36.646389 13919 net.cpp:131] Top shape: 100 1 28 28 (78400)
I0610 04:53:36.646394 13919 net.cpp:131] Top shape: 100 (100)
I0610 04:53:36.646397 13919 net.cpp:139] Memory required for data: 314000
# 这一层由caffe解析后自动加上，label_mnist_1_split
I0610 04:53:36.646402 13919 layer_factory.hpp:77] Creating layer label_mnist_1_split
I0610 04:53:36.646409 13919 net.cpp:86] Creating Layer label_mnist_1_split
I0610 04:53:36.646426 13919 net.cpp:408] label_mnist_1_split <- label
# 复制两个label_mnist_1_split，一个用于计算最终accuracy，一个用于计算最终loss
I0610 04:53:36.646445 13919 net.cpp:382] label_mnist_1_split -> label_mnist_1_split_0
I0610 04:53:36.646454 13919 net.cpp:382] label_mnist_1_split -> label_mnist_1_split_1
I0610 04:53:36.646559 13919 net.cpp:124] Setting up label_mnist_1_split
I0610 04:53:36.646585 13919 net.cpp:131] Top shape: 100 (100)
I0610 04:53:36.646590 13919 net.cpp:131] Top shape: 100 (100)
I0610 04:53:36.646595 13919 net.cpp:139] Memory required for data: 314800
I0610 04:53:36.646598 13919 layer_factory.hpp:77] Creating layer ip
I0610 04:53:36.646606 13919 net.cpp:86] Creating Layer ip
I0610 04:53:36.646611 13919 net.cpp:408] ip <- data
I0610 04:53:36.646617 13919 net.cpp:382] ip -> ip
I0610 04:53:36.646811 13919 net.cpp:124] Setting up ip
I0610 04:53:36.646819 13919 net.cpp:131] Top shape: 100 10 (1000)
I0610 04:53:36.646824 13919 net.cpp:139] Memory required for data: 318800
# 这一层由caffe解析后自动加上，ip_ip_0_split，
I0610 04:53:36.646834 13919 layer_factory.hpp:77] Creating layer ip_ip_0_split
I0610 04:53:36.646840 13919 net.cpp:86] Creating Layer ip_ip_0_split
I0610 04:53:36.646845 13919 net.cpp:408] ip_ip_0_split <- ip
# 复制两个ip_ip_0_split，一个用于计算最终accuracy，一个用于计算最终loss
I0610 04:53:36.646852 13919 net.cpp:382] ip_ip_0_split -> ip_ip_0_split_0
I0610 04:53:36.646859 13919 net.cpp:382] ip_ip_0_split -> ip_ip_0_split_1
I0610 04:53:36.646891 13919 net.cpp:124] Setting up ip_ip_0_split
I0610 04:53:36.646898 13919 net.cpp:131] Top shape: 100 10 (1000)
I0610 04:53:36.646914 13919 net.cpp:131] Top shape: 100 10 (1000)
I0610 04:53:36.646919 13919 net.cpp:139] Memory required for data: 326800
# _0 计算accuracy
I0610 04:53:36.646940 13919 layer_factory.hpp:77] Creating layer accuracy
I0610 04:53:36.646947 13919 net.cpp:86] Creating Layer accuracy
I0610 04:53:36.646952 13919 net.cpp:408] accuracy <- ip_ip_0_split_0
I0610 04:53:36.646957 13919 net.cpp:408] accuracy <- label_mnist_1_split_0
I0610 04:53:36.646963 13919 net.cpp:382] accuracy -> accuracy
I0610 04:53:36.646972 13919 net.cpp:124] Setting up accuracy
I0610 04:53:36.646977 13919 net.cpp:131] Top shape: (1)
I0610 04:53:36.646981 13919 net.cpp:139] Memory required for data: 326804
# _0 计算loss
I0610 04:53:36.646986 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.646992 13919 net.cpp:86] Creating Layer loss
I0610 04:53:36.646997 13919 net.cpp:408] loss <- ip_ip_0_split_1
I0610 04:53:36.647002 13919 net.cpp:408] loss <- label_mnist_1_split_1
I0610 04:53:36.647024 13919 net.cpp:382] loss -> loss
I0610 04:53:36.647034 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.647753 13919 net.cpp:124] Setting up loss
I0610 04:53:36.647764 13919 net.cpp:131] Top shape: (1)
I0610 04:53:36.647770 13919 net.cpp:134]     with loss weight 1
I0610 04:53:36.647779 13919 net.cpp:139] Memory required for data: 326808
# 给出哪些需要后向传播，哪些不需要!!!!
I0610 04:53:36.647785 13919 net.cpp:200] loss needs backward computation.
I0610 04:53:36.647792 13919 net.cpp:202] accuracy does not need backward computation.
I0610 04:53:36.647799 13919 net.cpp:200] ip_ip_0_split needs backward computation.
I0610 04:53:36.647804 13919 net.cpp:200] ip needs backward computation.
I0610 04:53:36.647809 13919 net.cpp:202] label_mnist_1_split does not need backward computation.
I0610 04:53:36.647814 13919 net.cpp:202] mnist does not need backward computation.
# TEST网络输出accuracy和loss
I0610 04:53:36.647819 13919 net.cpp:244] This network produces output accuracy
I0610 04:53:36.647826 13919 net.cpp:244] This network produces output loss
I0610 04:53:36.647835 13919 net.cpp:257] Network initialization done.
I0610 04:53:36.647861 13919 solver.cpp:57] Solver scaffolding done.
# ～～～～～～～～～～～～～～～～～测试网络构建结束～～～～～～～～～～～～～～～～～

# ～～～～～～～～～～～～～～～～～训练测试开始执行～～～～～～～～～～～～～～～～～
I0610 04:53:36.647935 13919 caffe.cpp:239] Starting Optimization
I0610 04:53:36.647941 13919 solver.cpp:289] Solving lrNet
I0610 04:53:36.647945 13919 solver.cpp:290] Learning Rate Policy: inv
# 第一次迭代，打印测试结果
I0610 04:53:36.647997 13919 solver.cpp:347] Iteration 0, Testing net (#0)
I0610 04:53:36.648779 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:36.698768 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:36.699136 13919 solver.cpp:414]     Test net output #0: accuracy = 0.1184
I0610 04:53:36.699177 13919 solver.cpp:414]     Test net output #1: loss = 2.31538 (* 1 = 2.31538 loss)
# 训练迭代500次后，打印测试结果
I0610 04:53:36.872504 13919 solver.cpp:347] Iteration 500, Testing net (#0)
I0610 04:53:36.922243 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:36.923786 13919 solver.cpp:414]     Test net output #0: accuracy = 0.8987
I0610 04:53:36.923828 13919 solver.cpp:414]     Test net output #1: loss = 0.378121 (* 1 = 0.378121 loss)
I0610 04:53:37.032925 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:37.075943 13925 data_layer.cpp:73] Restarting data prefetching from start.
# 训练迭代又500次后，打印测试结果
I0610 04:53:37.098423 13919 solver.cpp:347] Iteration 1000, Testing net (#0)
I0610 04:53:37.149353 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:37.149734 13919 solver.cpp:414]     Test net output #0: accuracy = 0.9064
I0610 04:53:37.149776 13919 solver.cpp:414]     Test net output #1: loss = 0.3422 (* 1 = 0.3422 loss)
I0610 04:53:37.316992 13919 solver.cpp:347] Iteration 1500, Testing net (#0)
I0610 04:53:37.366453 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:37.366799 13919 solver.cpp:414]     Test net output #0: accuracy = 0.912
I0610 04:53:37.366842 13919 solver.cpp:414]     Test net output #1: loss = 0.321062 (* 1 = 0.321062 loss)
# 迭代过程的打印信息，略...
I0610 04:53:38.582134 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:38.680619 13919 solver.cpp:347] Iteration 4500, Testing net (#0)
I0610 04:53:38.731168 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:38.731528 13919 solver.cpp:414]     Test net output #0: accuracy = 0.9195
I0610 04:53:38.731568 13919 solver.cpp:414]     Test net output #1: loss = 0.292936 (* 1 = 0.292936 loss)
I0610 04:53:38.796696 13925 data_layer.cpp:73] Restarting data prefetching from start.
# 保存这个时候的模型和快照
I0610 04:53:38.909629 13919 solver.cpp:464] Snapshotting to binary proto file my_lr_iter_5000.caffemodel
I0610 04:53:38.910568 13919 sgd_solver.cpp:284] Snapshotting solver state to binary proto file my_lr_iter_5000.solverstate
I0610 04:53:38.910995 13919 solver.cpp:347] Iteration 5000, Testing net (#0)
I0610 04:53:38.961858 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:38.962236 13919 solver.cpp:414]     Test net output #0: accuracy = 0.9202
I0610 04:53:38.962280 13919 solver.cpp:414]     Test net output #1: loss = 0.289039 (* 1 = 0.289039 loss)
I0610 04:53:38.978883 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:39.144309 13919 solver.cpp:347] Iteration 5500, Testing net (#0)
I0610 04:53:39.193158 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:39.193552 13919 solver.cpp:414]     Test net output #0: accuracy = 0.92
I0610 04:53:39.193594 13919 solver.cpp:414]     Test net output #1: loss = 0.290407 (* 1 = 0.290407 loss)
I0610 04:53:39.237748 13925 data_layer.cpp:73] Restarting data prefetching from start.
# 迭代过程的打印信息，略...
I0610 04:53:40.892292 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:40.902132 13925 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:40.947559 13919 solver.cpp:347] Iteration 9500, Testing net (#0)
I0610 04:53:40.996812 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:40.997177 13919 solver.cpp:414]     Test net output #0: accuracy = 0.9214
I0610 04:53:40.997220 13919 solver.cpp:414]     Test net output #1: loss = 0.282299 (* 1 = 0.282299 loss)
# 保存这时候的模型和快照
I0610 04:53:41.165763 13919 solver.cpp:464] Snapshotting to binary proto file my_lr_iter_10000.caffemodel
I0610 04:53:41.166873 13919 sgd_solver.cpp:284] Snapshotting solver state to binary proto file my_lr_iter_10000.solverstate
I0610 04:53:41.167291 13919 solver.cpp:347] Iteration 10000, Testing net (#0)
I0610 04:53:41.218529 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:41.220217 13919 solver.cpp:414]     Test net output #0: accuracy = 0.922
I0610 04:53:41.220261 13919 solver.cpp:414]     Test net output #1: loss = 0.281314 (* 1 = 0.281314 loss)
I0610 04:53:41.220270 13919 solver.cpp:332] Optimization Done.
I0610 04:53:41.220276 13919 caffe.cpp:250] Optimization Done.
# ～～～～～～～～～～～～～～～～～训练测试执行结束～～～～～～～～～～～～～～～～～
# 完

总结：

过程是先从网络结构超参数文件和训练超参数文件中解析出TRAIN网络和TEST网络，后构建两个网络，最后开始训练。
根据每条信息在源码中的执行位置，可以追踪源码执行细节。比如Memory required for data如何统计内存使用量的，看net.cpp139行：

// memory_used_ 保存当前net占用内存空间
// 遍历当前net的每一个Layer
for (int layer_id = 0; layer_id < param.layer_size(); ++layer_id) {
    ...
    // 遍历当前Layer的top
    for (int top_id = 0; top_id < top_vecs_[layer_id].size(); ++top_id) {
        ...
        // 内存用量累加
        memory_used_ += top_vecs_[layer_id][top_id]->count();
    }
    // 打印这一层内存用量
    LOG_IF(INFO, Caffe::root_solver()) // 打印log
        << "Memory required for data: " << memory_used_ * sizeof(Dtype);
    ...
}

问题：

TEST网络为什么也需要后传计算？
上述结果是使用GPU计算的，但是在log中并没有显示使用了哪个.cu文件。肯定执行了CUDA文件，应该是没有记录。