caffe-数据&模型-模型输出log

当执行训练后:

1
2
~/caffe-master/build/tools/caffe train \
--solver=/media/junhui/DATA/caffe_workspace/my_linearReggresion/lr_solver.prototxt

训练过程会在终端打印,终端日志信息以glog的格式输出:这个格式包括当前时间,进程号,源码行号,代码行号,以及输出信息,这个信息用于观察网络当前执行到哪一步。来分析一下使用Linear Reggresion对mnist分类,这个例子虽小,但五脏俱全。在必要的地方用做了注释,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
Reggresion/lr_solver.prototxt 
I0610 04:53:35.880447 13919 caffe.cpp:204] Using GPUs 0
I0610 04:53:35.903647 13919 caffe.cpp:209] GPU 0: GeForce GTX 1050
# 解析训练超参数文件lr_solver.prototxt,并初始化solver
I0610 04:53:36.096128 13919 solver.cpp:45] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.01
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "my_lr"
solver_mode: GPU
device_id: 0
net: "/media/junhui/DATA/caffe_workspace/my_linearReggresion/mylr.prototxt"
train_state {
level: 0
stage: ""
}

# ~~~~~~~~~~~~~~~~~训练网络构建开始~~~~~~~~~~~~~~~~~
# 创建蓝图中的Net
I0610 04:53:36.096751 13919 solver.cpp:102] Creating training net from net file: /media/junhui/DATA/caffe_workspace/my_linearReggresion/mylr.prototxt
# 这里指出,用于TEST的数据层和accuracy层,的phase值为“TEST”,将不在“TRAIN”阶段使用
I0610 04:53:36.097002 13919 net.cpp:296] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist
I0610 04:53:36.097012 13919 net.cpp:296] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
# 解析网络结构参数文件mylr.prototxt,并初始化用作TRAIN的Net
I0610 04:53:36.097085 13919 net.cpp:53] Initializing net from parameters:
# 这里的3层layer堆叠起来才是用于训练的网络,从下面看,TEST网络就很清晰了
name: "lrNet"
state {
phase: TRAIN
level: 0
stage: ""
}
# 1. 数据层,生成两个top: LMDB->“data”&“label”
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.0039063
}
data_param {
source: "/media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_train_lmdb"
batch_size: 64
backend: LMDB
}
}
# 2. 全连接层,"data"->"ip"
layer {
name: "ip"
type: "InnerProduct"
bottom: "data"
top: "ip"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
# 3. softmax 层:"ip"&"label"->"loss"
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip"
bottom: "label"
top: "loss"
}
# 开始训练
# 从LMDB文件中读取训练数据,
I0610 04:53:36.097178 13919 layer_factory.hpp:77] Creating layer mnist
I0610 04:53:36.097410 13919 db_lmdb.cpp:35] Opened lmdb /media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_train_lmdb
# 1. 创建数据层,产生两个数据对象,“data”&“label”
I0610 04:53:36.097434 13919 net.cpp:86] Creating Layer mnist
I0610 04:53:36.097456 13919 net.cpp:382] mnist -> data
I0610 04:53:36.097476 13919 net.cpp:382] mnist -> label
# 数据行输出的大小为[64,1,28,28]
I0610 04:53:36.098232 13919 data_layer.cpp:45] output data size: 64,1,28,28
I0610 04:53:36.099445 13919 net.cpp:124] Setting up mnist
I0610 04:53:36.099474 13919 net.cpp:131] Top shape: 64 1 28 28 (50176)
I0610 04:53:36.099484 13919 net.cpp:131] Top shape: 64 (64)
# 统计内存占用,这个值会在train过程中累积
I0610 04:53:36.099489 13919 net.cpp:139] Memory required for data: 200960
# 2. 创建ip,全连接层
I0610 04:53:36.099498 13919 layer_factory.hpp:77] Creating layer ip
I0610 04:53:36.099509 13919 net.cpp:86] Creating Layer ip
# 从“data”生层“ip”,就是这层的输出
I0610 04:53:36.099529 13919 net.cpp:408] ip <- data
I0610 04:53:36.099541 13919 net.cpp:382] ip -> ip
I0610 04:53:36.100448 13919 net.cpp:124] Setting up ip
I0610 04:53:36.100461 13919 net.cpp:131] Top shape: 64 10 (640)
I0610 04:53:36.100478 13919 net.cpp:139] Memory required for data: 203520
# 3. 创建最后一层得到loss
I0610 04:53:36.100493 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.100505 13919 net.cpp:86] Creating Layer loss
# 该层输入为“ip”&“label”输出为“loss”
I0610 04:53:36.100510 13919 net.cpp:408] loss <- ip
I0610 04:53:36.100517 13919 net.cpp:408] loss <- label
I0610 04:53:36.100523 13919 net.cpp:382] loss -> loss
I0610 04:53:36.100535 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.643620 13919 net.cpp:124] Setting up loss
# 输出的loss大小为1,其权值为1
I0610 04:53:36.643661 13919 net.cpp:131] Top shape: (1)
I0610 04:53:36.643664 13919 net.cpp:134] with loss weight 1
# 目前所占内存空间 200MB
I0610 04:53:36.643699 13919 net.cpp:139] Memory required for data: 203524
# 从后先前执行反向计算,哪里需要计算,就算哪里
I0610 04:53:36.643705 13919 net.cpp:200] loss needs backward computation.
I0610 04:53:36.643714 13919 net.cpp:200] ip needs backward computation.
I0610 04:53:36.643719 13919 net.cpp:202] mnist does not need backward computation.
# TRAIN网络只输出loss
I0610 04:53:36.643726 13919 net.cpp:244] This network produces output loss
I0610 04:53:36.643734 13919 net.cpp:257] Network initialization done.
# ~~~~~~~~~~~~~~~~~训练网络构建结束~~~~~~~~~~~~~~~~~

# ~~~~~~~~~~~~~~~~~测试网络构建开始~~~~~~~~~~~~~~~~~
I0610 04:53:36.644055 13919 solver.cpp:190] Creating test net (#0) specified by net file: /media/junhui/DATA/caffe_workspace/my_linearReggresion/mylr.prototxt
# 这里指出,用于TRAIN的数据层的phase值为“TRAIN”,将不在“TEST”阶段使用
I0610 04:53:36.644089 13919 net.cpp:296] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist
I0610 04:53:36.644132 13919 net.cpp:53] Initializing net from parameters:
# 同样的,给出完整的TEST网络的结构
name: "lrNet"
state {
phase: TEST # 用于TEST
}
# 数据层
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
scale: 0.0039063
}
data_param {
source: "/media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_test_lmdb"
batch_size: 100
backend: LMDB
}
}
# 全连接层
layer {
name: "ip"
type: "InnerProduct"
bottom: "data"
top: "ip"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
# 计算accuracy
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
# 计算loss
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip"
bottom: "label"
top: "loss"
}
I0610 04:53:36.644286 13919 layer_factory.hpp:77] Creating layer mnist
I0610 04:53:36.644841 13919 db_lmdb.cpp:35] Opened lmdb /media/junhui/DATA/caffe_workspace/my_linearReggresion/mnist_test_lmdb
I0610 04:53:36.644881 13919 net.cpp:86] Creating Layer mnist
I0610 04:53:36.644889 13919 net.cpp:382] mnist -> data
I0610 04:53:36.644898 13919 net.cpp:382] mnist -> label
I0610 04:53:36.645038 13919 data_layer.cpp:45] output data size: 100,1,28,28
I0610 04:53:36.646373 13919 net.cpp:124] Setting up mnist
I0610 04:53:36.646389 13919 net.cpp:131] Top shape: 100 1 28 28 (78400)
I0610 04:53:36.646394 13919 net.cpp:131] Top shape: 100 (100)
I0610 04:53:36.646397 13919 net.cpp:139] Memory required for data: 314000
# 这一层由caffe解析后自动加上,label_mnist_1_split
I0610 04:53:36.646402 13919 layer_factory.hpp:77] Creating layer label_mnist_1_split
I0610 04:53:36.646409 13919 net.cpp:86] Creating Layer label_mnist_1_split
I0610 04:53:36.646426 13919 net.cpp:408] label_mnist_1_split <- label
# 复制两个label_mnist_1_split,一个用于计算最终accuracy,一个用于计算最终loss
I0610 04:53:36.646445 13919 net.cpp:382] label_mnist_1_split -> label_mnist_1_split_0
I0610 04:53:36.646454 13919 net.cpp:382] label_mnist_1_split -> label_mnist_1_split_1
I0610 04:53:36.646559 13919 net.cpp:124] Setting up label_mnist_1_split
I0610 04:53:36.646585 13919 net.cpp:131] Top shape: 100 (100)
I0610 04:53:36.646590 13919 net.cpp:131] Top shape: 100 (100)
I0610 04:53:36.646595 13919 net.cpp:139] Memory required for data: 314800
I0610 04:53:36.646598 13919 layer_factory.hpp:77] Creating layer ip
I0610 04:53:36.646606 13919 net.cpp:86] Creating Layer ip
I0610 04:53:36.646611 13919 net.cpp:408] ip <- data
I0610 04:53:36.646617 13919 net.cpp:382] ip -> ip
I0610 04:53:36.646811 13919 net.cpp:124] Setting up ip
I0610 04:53:36.646819 13919 net.cpp:131] Top shape: 100 10 (1000)
I0610 04:53:36.646824 13919 net.cpp:139] Memory required for data: 318800
# 这一层由caffe解析后自动加上,ip_ip_0_split,
I0610 04:53:36.646834 13919 layer_factory.hpp:77] Creating layer ip_ip_0_split
I0610 04:53:36.646840 13919 net.cpp:86] Creating Layer ip_ip_0_split
I0610 04:53:36.646845 13919 net.cpp:408] ip_ip_0_split <- ip
# 复制两个ip_ip_0_split,一个用于计算最终accuracy,一个用于计算最终loss
I0610 04:53:36.646852 13919 net.cpp:382] ip_ip_0_split -> ip_ip_0_split_0
I0610 04:53:36.646859 13919 net.cpp:382] ip_ip_0_split -> ip_ip_0_split_1
I0610 04:53:36.646891 13919 net.cpp:124] Setting up ip_ip_0_split
I0610 04:53:36.646898 13919 net.cpp:131] Top shape: 100 10 (1000)
I0610 04:53:36.646914 13919 net.cpp:131] Top shape: 100 10 (1000)
I0610 04:53:36.646919 13919 net.cpp:139] Memory required for data: 326800
# _0 计算accuracy
I0610 04:53:36.646940 13919 layer_factory.hpp:77] Creating layer accuracy
I0610 04:53:36.646947 13919 net.cpp:86] Creating Layer accuracy
I0610 04:53:36.646952 13919 net.cpp:408] accuracy <- ip_ip_0_split_0
I0610 04:53:36.646957 13919 net.cpp:408] accuracy <- label_mnist_1_split_0
I0610 04:53:36.646963 13919 net.cpp:382] accuracy -> accuracy
I0610 04:53:36.646972 13919 net.cpp:124] Setting up accuracy
I0610 04:53:36.646977 13919 net.cpp:131] Top shape: (1)
I0610 04:53:36.646981 13919 net.cpp:139] Memory required for data: 326804
# _0 计算loss
I0610 04:53:36.646986 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.646992 13919 net.cpp:86] Creating Layer loss
I0610 04:53:36.646997 13919 net.cpp:408] loss <- ip_ip_0_split_1
I0610 04:53:36.647002 13919 net.cpp:408] loss <- label_mnist_1_split_1
I0610 04:53:36.647024 13919 net.cpp:382] loss -> loss
I0610 04:53:36.647034 13919 layer_factory.hpp:77] Creating layer loss
I0610 04:53:36.647753 13919 net.cpp:124] Setting up loss
I0610 04:53:36.647764 13919 net.cpp:131] Top shape: (1)
I0610 04:53:36.647770 13919 net.cpp:134] with loss weight 1
I0610 04:53:36.647779 13919 net.cpp:139] Memory required for data: 326808
# 给出哪些需要后向传播,哪些不需要!!!!
I0610 04:53:36.647785 13919 net.cpp:200] loss needs backward computation.
I0610 04:53:36.647792 13919 net.cpp:202] accuracy does not need backward computation.
I0610 04:53:36.647799 13919 net.cpp:200] ip_ip_0_split needs backward computation.
I0610 04:53:36.647804 13919 net.cpp:200] ip needs backward computation.
I0610 04:53:36.647809 13919 net.cpp:202] label_mnist_1_split does not need backward computation.
I0610 04:53:36.647814 13919 net.cpp:202] mnist does not need backward computation.
# TEST网络输出accuracy和loss
I0610 04:53:36.647819 13919 net.cpp:244] This network produces output accuracy
I0610 04:53:36.647826 13919 net.cpp:244] This network produces output loss
I0610 04:53:36.647835 13919 net.cpp:257] Network initialization done.
I0610 04:53:36.647861 13919 solver.cpp:57] Solver scaffolding done.
# ~~~~~~~~~~~~~~~~~测试网络构建结束~~~~~~~~~~~~~~~~~

# ~~~~~~~~~~~~~~~~~训练测试开始执行~~~~~~~~~~~~~~~~~
I0610 04:53:36.647935 13919 caffe.cpp:239] Starting Optimization
I0610 04:53:36.647941 13919 solver.cpp:289] Solving lrNet
I0610 04:53:36.647945 13919 solver.cpp:290] Learning Rate Policy: inv
# 第一次迭代,打印测试结果
I0610 04:53:36.647997 13919 solver.cpp:347] Iteration 0, Testing net (#0)
I0610 04:53:36.648779 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:36.698768 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:36.699136 13919 solver.cpp:414] Test net output #0: accuracy = 0.1184
I0610 04:53:36.699177 13919 solver.cpp:414] Test net output #1: loss = 2.31538 (* 1 = 2.31538 loss)
# 训练迭代500次后,打印测试结果
I0610 04:53:36.872504 13919 solver.cpp:347] Iteration 500, Testing net (#0)
I0610 04:53:36.922243 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:36.923786 13919 solver.cpp:414] Test net output #0: accuracy = 0.8987
I0610 04:53:36.923828 13919 solver.cpp:414] Test net output #1: loss = 0.378121 (* 1 = 0.378121 loss)
I0610 04:53:37.032925 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:37.075943 13925 data_layer.cpp:73] Restarting data prefetching from start.
# 训练迭代又500次后,打印测试结果
I0610 04:53:37.098423 13919 solver.cpp:347] Iteration 1000, Testing net (#0)
I0610 04:53:37.149353 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:37.149734 13919 solver.cpp:414] Test net output #0: accuracy = 0.9064
I0610 04:53:37.149776 13919 solver.cpp:414] Test net output #1: loss = 0.3422 (* 1 = 0.3422 loss)
I0610 04:53:37.316992 13919 solver.cpp:347] Iteration 1500, Testing net (#0)
I0610 04:53:37.366453 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:37.366799 13919 solver.cpp:414] Test net output #0: accuracy = 0.912
I0610 04:53:37.366842 13919 solver.cpp:414] Test net output #1: loss = 0.321062 (* 1 = 0.321062 loss)
# 迭代过程的打印信息,略...
I0610 04:53:38.582134 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:38.680619 13919 solver.cpp:347] Iteration 4500, Testing net (#0)
I0610 04:53:38.731168 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:38.731528 13919 solver.cpp:414] Test net output #0: accuracy = 0.9195
I0610 04:53:38.731568 13919 solver.cpp:414] Test net output #1: loss = 0.292936 (* 1 = 0.292936 loss)
I0610 04:53:38.796696 13925 data_layer.cpp:73] Restarting data prefetching from start.
# 保存这个时候的模型和快照
I0610 04:53:38.909629 13919 solver.cpp:464] Snapshotting to binary proto file my_lr_iter_5000.caffemodel
I0610 04:53:38.910568 13919 sgd_solver.cpp:284] Snapshotting solver state to binary proto file my_lr_iter_5000.solverstate
I0610 04:53:38.910995 13919 solver.cpp:347] Iteration 5000, Testing net (#0)
I0610 04:53:38.961858 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:38.962236 13919 solver.cpp:414] Test net output #0: accuracy = 0.9202
I0610 04:53:38.962280 13919 solver.cpp:414] Test net output #1: loss = 0.289039 (* 1 = 0.289039 loss)
I0610 04:53:38.978883 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:39.144309 13919 solver.cpp:347] Iteration 5500, Testing net (#0)
I0610 04:53:39.193158 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:39.193552 13919 solver.cpp:414] Test net output #0: accuracy = 0.92
I0610 04:53:39.193594 13919 solver.cpp:414] Test net output #1: loss = 0.290407 (* 1 = 0.290407 loss)
I0610 04:53:39.237748 13925 data_layer.cpp:73] Restarting data prefetching from start.
# 迭代过程的打印信息,略...
I0610 04:53:40.892292 13919 blocking_queue.cpp:49] Waiting for data
I0610 04:53:40.902132 13925 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:40.947559 13919 solver.cpp:347] Iteration 9500, Testing net (#0)
I0610 04:53:40.996812 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:40.997177 13919 solver.cpp:414] Test net output #0: accuracy = 0.9214
I0610 04:53:40.997220 13919 solver.cpp:414] Test net output #1: loss = 0.282299 (* 1 = 0.282299 loss)
# 保存这时候的模型和快照
I0610 04:53:41.165763 13919 solver.cpp:464] Snapshotting to binary proto file my_lr_iter_10000.caffemodel
I0610 04:53:41.166873 13919 sgd_solver.cpp:284] Snapshotting solver state to binary proto file my_lr_iter_10000.solverstate
I0610 04:53:41.167291 13919 solver.cpp:347] Iteration 10000, Testing net (#0)
I0610 04:53:41.218529 13926 data_layer.cpp:73] Restarting data prefetching from start.
I0610 04:53:41.220217 13919 solver.cpp:414] Test net output #0: accuracy = 0.922
I0610 04:53:41.220261 13919 solver.cpp:414] Test net output #1: loss = 0.281314 (* 1 = 0.281314 loss)
I0610 04:53:41.220270 13919 solver.cpp:332] Optimization Done.
I0610 04:53:41.220276 13919 caffe.cpp:250] Optimization Done.
# ~~~~~~~~~~~~~~~~~训练测试执行结束~~~~~~~~~~~~~~~~~
# 完

总结:

  • 过程是先从网络结构超参数文件和训练超参数文件中解析出TRAIN网络和TEST网络,后构建两个网络,最后开始训练。
  • 根据每条信息在源码中的执行位置,可以追踪源码执行细节。比如Memory required for data如何统计内存使用量的,看net.cpp139行:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// memory_used_ 保存当前net占用内存空间
// 遍历当前net的每一个Layer
for (int layer_id = 0; layer_id < param.layer_size(); ++layer_id) {
...
// 遍历当前Layer的top
for (int top_id = 0; top_id < top_vecs_[layer_id].size(); ++top_id) {
...
// 内存用量累加
memory_used_ += top_vecs_[layer_id][top_id]->count();
}
// 打印这一层内存用量
LOG_IF(INFO, Caffe::root_solver()) // 打印log
<< "Memory required for data: " << memory_used_ * sizeof(Dtype);
...
}

问题

  1. TEST网络为什么也需要后传计算?
  2. 上述结果是使用GPU计算的,但是在log中并没有显示使用了哪个.cu文件。肯定执行了CUDA文件,应该是没有记录。