2020-06-03

Caffe

caffe 命令行与python接口

命令行接口 cmdcaffe

caffe经过编译后才会生成对应的工具，这个工具在目录caffe-ROOT/build/tools中，在此路目录中可用的命令有：

./caffe train           #train or finetune a model
./caffe test            #score a model
./caffe device_query    #show GPU diagnostic information
./caffe time            #benchmark model execution time

训练

caffe提供三种训练方式。

从头开始训练模型。需要提供.prototxt配置文件的路径，如：

# 训练，默认使用CPU
./build/tools/caffe train \
-solver examples/mnist/lenet_solver.prototxt
# 使用编号为2 的GPU训练
./build/tools/caffe train \
-solver examples/mnist/lenet_solver.prototxt \
-gpu 2

从snapshot中恢复训练。需要提供.solverstate文件路径
1
2
3
4
# 提供 -snapshot继续训练
./build/tools/caffe train \
-solver examples/mnist/lenet_solver.prototxt \
-snapshot examples/mnist/lenet_iter_5000.solverstate
如果最初设定的最大训练次数不够的话，可以在配置文件lenet_prototxt.solver中修改max_iter: 10000，比如增加此时为20000.

使用预训练模型微调(迁移学习)。需要提供.caffemodel文件路径

# 指明 -weights 关键字，提供预训练模型
./build/tools/caffe train \
-solver examples/finetuning_on_flickr_style/solver.prototxt \
-weights models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel

这里由完整的微调例子examples/finetuning_on_flickr_style

多GPU并行

在-gpu后指定要使用的GPU编号，如-gpu 0,1,2,3，表示使用4个GPU并行计算。使用多GPU时，相同的网络配置会被分配到每一个gpu，每一个GPU所处理数据的batch_size相同，所以，整体并行处理的数据量是batch_size*4。

1 2	# 使用可用的所有GPU设备 caffe train -solver examples/mnist/lenet_solver.prototxt -gpu all

检查GPU

使用如下命令检查指定GPU是否正常工作：

1	./build/tools/caffe device_query -gpu 0

返回0号GPU的硬件信息：

I0603 16:40:28.905443 13455 caffe.cpp:138] Querying GPUs 0
I0603 16:40:28.927069 13455 common.cpp:178] Device id:                     0
I0603 16:40:28.927090 13455 common.cpp:179] Major revision number:         6
I0603 16:40:28.927093 13455 common.cpp:180] Minor revision number:         1
I0603 16:40:28.927096 13455 common.cpp:181] Name:                          GeForce GTX 1050
I0603 16:40:28.927099 13455 common.cpp:182] Total global memory:           2099904512
I0603 16:40:28.927103 13455 common.cpp:183] Total shared memory per block: 49152
I0603 16:40:28.927106 13455 common.cpp:184] Total registers per block:     65536
I0603 16:40:28.927109 13455 common.cpp:185] Warp size:                     32
I0603 16:40:28.927112 13455 common.cpp:186] Maximum memory pitch:          2147483647
I0603 16:40:28.927115 13455 common.cpp:187] Maximum threads per block:     1024
I0603 16:40:28.927119 13455 common.cpp:188] Maximum dimension of block:    1024, 1024, 64
I0603 16:40:28.927122 13455 common.cpp:191] Maximum dimension of grid:     2147483647, 65535, 65535
I0603 16:40:28.927125 13455 common.cpp:194] Clock rate:                    1493000
I0603 16:40:28.927129 13455 common.cpp:195] Total constant memory:         65536
I0603 16:40:28.927151 13455 common.cpp:196] Texture alignment:             512
I0603 16:40:28.927155 13455 common.cpp:197] Concurrent copy and execution: Yes
I0603 16:40:28.927160 13455 common.cpp:199] Number of multiprocessors:     5
I0603 16:40:28.927183 13455 common.cpp:200] Kernel execution timeout:      Yes

准确度测试

测试会给出模型的每一batch的loss和accuracy以及整体平均的loss和accuracy。test表示只进行forward计算，没有backward。即推理，而非训练。

./build/tools/caffe test \
-model examples/mnist/lenet_train_test.prototxt \
-weights examples/mnist/lenet_iter_10000.caffemodel \
-gpu 0 \
-iterations 100

在lenet_train_test.prototxt所定义的模型结构上，使用模型lenet_iter_10000.caffemodel，对测试样本执行100次iteration。batch_size为100，所以iteration×batch_size=10000，覆盖了所有的测试样本这个测试数据在哪??

时间测试

指明./build/tools/caffe time测试模型，输出每一层的前先计算后向计算的时间。

下面为lenet计时cpu计算10次迭代。（默认测试50次迭代）

1
2
3

./build/tools/caffe time \
-model examples/mnist/lenet_train_test.prototxt \
-iterations 10

结果：

I0603 17:30:35.768501 15346 caffe.cpp:365] *** Benchmark begins ***
I0603 17:30:35.768518 15346 caffe.cpp:366] Testing for 10 iterations.
I0603 17:30:35.835475 15346 caffe.cpp:394] Iteration: 1 forward-backward time: 66 ms.
I0603 17:30:35.902711 15346 caffe.cpp:394] Iteration: 2 forward-backward time: 67 ms.
I0603 17:30:35.969769 15346 caffe.cpp:394] Iteration: 3 forward-backward time: 67 ms.
I0603 17:30:36.036651 15346 caffe.cpp:394] Iteration: 4 forward-backward time: 66 ms.
I0603 17:30:36.105055 15346 caffe.cpp:394] Iteration: 5 forward-backward time: 68 ms.
I0603 17:30:36.174151 15346 caffe.cpp:394] Iteration: 6 forward-backward time: 69 ms.
I0603 17:30:36.241129 15346 caffe.cpp:394] Iteration: 7 forward-backward time: 66 ms.
I0603 17:30:36.308782 15346 caffe.cpp:394] Iteration: 8 forward-backward time: 67 ms.
I0603 17:30:36.376447 15346 caffe.cpp:394] Iteration: 9 forward-backward time: 67 ms.
I0603 17:30:36.443658 15346 caffe.cpp:394] Iteration: 10 forward-backward time: 67 ms.
I0603 17:30:36.443676 15346 caffe.cpp:397] Average time per layer: 
I0603 17:30:36.443698 15346 caffe.cpp:400]      mnist	forward: 0.015 ms.
I0603 17:30:36.443706 15346 caffe.cpp:403]      mnist	backward: 0.0009 ms.
I0603 17:30:36.443711 15346 caffe.cpp:400]      conv1	forward: 7.4511 ms.
I0603 17:30:36.443714 15346 caffe.cpp:403]      conv1	backward: 7.8538 ms.
I0603 17:30:36.443718 15346 caffe.cpp:400]      pool1	forward: 3.3165 ms.
I0603 17:30:36.443740 15346 caffe.cpp:403]      pool1	backward: 0.5728 ms.
I0603 17:30:36.443745 15346 caffe.cpp:400]      conv2	forward: 12.81 ms.
I0603 17:30:36.443769 15346 caffe.cpp:403]      conv2	backward: 25.1095 ms.
I0603 17:30:36.443774 15346 caffe.cpp:400]      pool2	forward: 1.5992 ms.
I0603 17:30:36.443778 15346 caffe.cpp:403]      pool2	backward: 0.5698 ms.
I0603 17:30:36.443783 15346 caffe.cpp:400]        ip1	forward: 2.6873 ms.
I0603 17:30:36.443787 15346 caffe.cpp:403]        ip1	backward: 4.9053 ms.
I0603 17:30:36.443791 15346 caffe.cpp:400]      relu1	forward: 0.0563 ms.
I0603 17:30:36.443809 15346 caffe.cpp:403]      relu1	backward: 0.0507 ms.
I0603 17:30:36.443814 15346 caffe.cpp:400]        ip2	forward: 0.1712 ms.
I0603 17:30:36.443819 15346 caffe.cpp:403]        ip2	backward: 0.2362 ms.
I0603 17:30:36.443845 15346 caffe.cpp:400]       loss	forward: 0.0529 ms.
I0603 17:30:36.443848 15346 caffe.cpp:403]       loss	backward: 0.0013 ms.
I0603 17:30:36.443868 15346 caffe.cpp:408] Average Forward pass: 28.1725 ms.
I0603 17:30:36.443892 15346 caffe.cpp:410] Average Backward pass: 39.3101 ms.
I0603 17:30:36.443895 15346 caffe.cpp:412] Average Forward-Backward: 67.5 ms.
I0603 17:30:36.443900 15346 caffe.cpp:414] Total Time: 675 ms.
I0603 17:30:36.443918 15346 caffe.cpp:415] *** Benchmark ends ***

使用GPU测试10 侧迭代：

./build/tools/caffe time \
-model examples/mnist/lenet_train_test.prototxt \
-gpu 0 \
-iterations 10

结果：

I0603 17:32:11.830056 15434 caffe.cpp:365] *** Benchmark begins ***
... # 省略
I0603 17:32:11.876488 15434 caffe.cpp:414] Total Time: 43.9143 ms.
I0603 17:32:11.876494 15434 caffe.cpp:415] *** Benchmark ends ***

测试某个训练好的模型各层执行时间。

./build/tools/caffe time \
-model examples/mnist/lenet_train_test.prototxt \
-weights examples/mnist/lenet_iter_10000.caffemodel \
-gpu 0 \
-iterations 10

python接口 pycaffe

pycaffe接口需要先编译，看这里

在caffe/examples中的ipython notebook中是使用pycaffe的实例。

2020-06-03

C++

cpp-lambda function

在STL中的许多函数都需要提供一个binary comp function，指明是从大到小还是从小到大，比如sort()函数，最大最小堆等。

这个 comp函数可以是函数指针或函数对象。也可以是个lambda函数：

lambda函数

1	cout<<[](float f)->int {return abs(f)};(-3.5)<<endl;

返回3.

其中[]中是lambda indicators。用法如下：

[ ]：//不捕获任何外部变量
[=]：//以值的形式捕获所有外部变量
[&]：//以引用的形式捕获所有外部变量
[x, &y]：//x以值捕获，y以引用捕获
[=, &z]：//z以引用捕获，其他以值形式捕获
[&, x]：//x以值行形式捕获，其他以引用形式捕获

例子：

auto comp = [](const auto& x, const auto& y) { return x.second < y.second; };

sort(vec.begin(), vec.end(), comp)
sort(vec.begin(), vec.end(), [](auto x, auto y){return x>y});

2020-06-03

Utility

pip下载加速

pip下载时添加国内源：

清华：https://pypi.tuna.tsinghua.edu.cn/simple \

临时使用：

可以在使用pip的时候加参数-i https://pypi.tuna.tsinghua.edu.cn/simple

1	pip install -i https://pypi.tuna.tsinghua.edu.cn/simple protobuf

永久修改：

Linux下，在文件~/.pip/pip.conf (没有就创建一个文件夹及文件)添加内容如下：

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
[install]
trusted-host=mirrors.aliyun.com

2020-06-03

Utility

论文下载加速

如果你使用科学上网，那么在arxiv上下载论文没有什么问题，如果没有，下面方法可以解决加载下载缓慢的问题。

解决方法

使用中科院arxiv的镜像地址：http://xxx.itp.ac.cn

具体使用

将要访问 arxiv 链接中的域名从 https://arxiv.org 换成 http://xxx.itp.ac.cn

如：

https://arxiv.org/pdf/1608.00367

换成：

http://xxx.itp.ac.cn/pdf/1608.00367

2020-06-03

LeetCode

LeetCode-对称的二叉树

判断一棵二叉树是否是对称的，例子如下。左边数是一个对称二叉树，而右边不是：

    5               5
   / \             / \
  4   4           5   5
 /\   /\         /\   /
7 8   8 7       5 5  5

思路

对于左边的树，既然对称那么遍历顺序root->l->r与root->r->l得到的节点序列是相同的。均为{5,4,7,8,4,8,7}，而对于右边的非对称树，两种遍历序列也是相同的。所以此方法不可行，而且还用到额外的存储空间。

但是如果将遍历到的空结点也放入序列中，就可以了。不过下面的实现，使用online方法，避免二外空间的使用。

实现

/**
 * Definition for a binary tree node.
 * struct TreeNode {
 *     int val;
 *     TreeNode *left;
 *     TreeNode *right;
 *     TreeNode() : val(0), left(nullptr), right(nullptr) {}
 *     TreeNode(int x) : val(x), left(nullptr), right(nullptr) {}
 *     TreeNode(int x, TreeNode *left, TreeNode *right) : val(x), left(left), right(right) {}
 * };
 */
class Solution {
public:
    bool isSymmetric(TreeNode* root) {
        return symetric(root, root);
    }
    
    bool symetric(TreeNode* root1, TreeNode* root2){
        
        // 都为空：
        if (root1==nullptr && root2==nullptr) return true;
        // 只有一个为空：
        if (root1==nullptr || root2==nullptr) return false;
        // 都不为空，但值不相等
        if (root1->val != root2->val) return false;
        
        // 值相等： 
        // 则继续判断 root1->左==root2->右 && root1->右==root2->左
        return symetric(root1->left, root2->right) && 
                symetric(root1->right, root2->left);
    }
};

体会这里完备的表达式：

// 都为空：
if (root1==nullptr && root2==nullptr) return true;
// 只有一个为空：
if (root1==nullptr || root2==nullptr) return false;
// 都不为空，但值不相等
if (root1->val != root2->val) return false;

neat！！

2020-06-03

LeetCode

LeetCode-判断是否是子树

有两棵树，A和B，判断B是否是A的子树。
如下图左为A，右为B：

    5
   / \
  4   6           4
 / \   \         / \
7   8   9       7   8
   / \
  1   2

有两种情况：

B 可以只是A子树的一部分。那么上例中B是A的子树。
B 严格是A的子树。那么上例中B不是A的子树。

思路

两种情况主题思路是一样的：
第一步，在Ａ中找与B的根节点一样的结点。先序遍历，如果B的根节点与Ａ的当前结点不同，那么分别考察Ａ的左子树和右子树。

第二步，当在Ａ中找到B的根节点一样的结点Ｒ后，判断Ａ以Ｒ为树根的子树是否与Ｂ相同。此时就要区分上述两种情况了。

实现：

class Solution {
public:
    // 先序遍历找到R
    bool findR(Node* s, Node* t) {
        if (!s && !t) return false;
        if (!s || !t) return false;

        bool res = false;

        if (s->val == t->val) {
            res = isSubTree(s, t);
        }
        if (!res) res = findR(s->left, t);
        if (!res) res = findR(s->right, t);

        return res;
    }

private:
    bool isSubTree(Node* a, Node* b) {
        
        /// relativaly equal 第一种情况
        if (a==nullptr && b==nullptr) return true;
        if (a==nullptr && b!=nullptr) return false;
        if (a!=nullptr && b==nullptr) return true; 

        /// absolutly equal 第二种情况
        //if (a==nullptr && b==nullptr) return true;
        //if ((!a && b) || (a && !b)) return false;  

        if (a->val != b->val) return  false;
        return isSubTree(a->left, b->left) && isSubTree(a->right, b->right);
    }
};

注意：

只要访问一个对象，就要提前判断这个对象是否合法。
树的问题绝大数情况是要对数进行遍历，上述问题就是先序遍历。
注意递归返回值，返回值是bool型，

2020-06-02

Caffe

caffe-sigmoid_cross_entropy_loss_layer类

sigmoid_cross_entropy_loss_layer类

头文件： ./include/caffe/layers/sigmoid_cross_entropy_loss_layer.hpp
CPU实现： ./src/caffe/layers/sigmoid_cross_entropy_loss_layer.cpp
GPU实现： ./src/caffe/layers/sigmoid_cross_entropy_loss_layer.cu

所需要基本操作的CPU和GPU实现：

bottom_diff = sigmoid_output_data - target 的实现如下

// CPU
caffe_sub(count, sigmoid_output_data, target, bottom_diff):

// 对应的GPU
caffe_copy(count, sigmoid_output_data, bottom_diff);
caffe_gpu_axpy(count, Dtype(-1), target, bottom_diff);

bottom_diff 中每个元素乘以loss_weight，共操作count个元素。其实现如下：

// CPU
caffe_scal(count, loss_weight, bottom_diff);
// GPU
caffe_gpu_scal(count, loss_weight, bottom_diff);

上述函数分别使用了cBLAS 和cuBlas两个库函数。sigmoid_output_data是前向传播的结果。上述两步其实是反向传播的过程，最终将结果写入bottom_diff中，它是Blob的一部分，会随着数据的走向继续传播下去。

后向传播

CPU

template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Backward_cpu(
						const vector<Blob<Dtype>*>& top, 
						const vector<bool>& propagate_down,
						const vector<Blob<Dtype>*>& bottom) {
if (propagate_down[1]) {
	LOG(FATAL) << this->type()
			<< " Layer cannot backpropagate to label inputs.";
}
if (propagate_down[0]) {
	// First, compute the diff
	const int count = bottom[0]->count();
	const Dtype* sigmoid_output_data = sigmoid_output_->cpu_data();
	const Dtype* target = bottom[1]->cpu_data();
	Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
	caffe_sub(count, sigmoid_output_data, target, bottom_diff);
	// Zero out gradient of ignored targets.
	if (has_ignore_label_) {
	for (int i = 0; i < count; ++i) {
			const int target_value = static_cast<int>(target[i]);
			if (target_value == ignore_label_) {
				bottom_diff[i] = 0;
			}
		}
	}
	// Scale down gradient
	Dtype loss_weight = top[0]->cpu_diff()[0] / normalizer_;
	caffe_scal(count, loss_weight, bottom_diff);
	}
}

因为是在CPU端，与GPU无关，所以上述code中没有gpu_data或gpu_diff。
主要操作，取数据，执行操作：

// 取Blob数据
const int count = bottom[0]->count();
const Dtype* sigmoid_output_data = sigmoid_output_->cpu_data();
const Dtype* target = bottom[1]->cpu_data();
Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
// 如上述操作
caffe_sub(count, sigmoid_output_data, target, bottom_diff);

GPU

与cpu相似：

template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Backward_gpu(
						const vector<Blob<Dtype>*>& top, 
						const vector<bool>& propagate_down,
						const vector<Blob<Dtype>*>& bottom) {
	if (propagate_down[1]) {
	LOG(FATAL) << this->type()
				<< " Layer cannot backpropagate to label inputs.";
	}
	if (propagate_down[0]) {
		// First, compute the diff
		const int count = bottom[0]->count();
		const Dtype* sigmoid_output_data = sigmoid_output_->gpu_data();
		const Dtype* target = bottom[1]->gpu_data();
		Dtype* bottom_diff = bottom[0]->mutable_gpu_diff();
		caffe_copy(count, sigmoid_output_data, bottom_diff);
		caffe_gpu_axpy(count, Dtype(-1), target, bottom_diff);
		// Zero out gradient of ignored targets.
		if (has_ignore_label_) {
			// NOLINT_NEXT_LINE(whitespace/operators)
			SigmoidCrossEntropyLossIgnoreDiffGPU<Dtype><<<CAFFE_GET_BLOCKS(count),
			CAFFE_CUDA_NUM_THREADS>>>(count, ignore_label_, target, bottom_diff);
		}
		// Scale down gradient
		Dtype loss_weight = top[0]->cpu_diff()[0] / normalizer_;
		caffe_gpu_scal(count, loss_weight, bottom_diff);
	}
}

加上kernel函数，其作用是将不需要计算梯度的位置设为零，与CPU含义相同：

template <typename Dtype>
__global__ void SigmoidCrossEntropyLossIgnoreDiffGPU(
						const int count,
						const int ignore_label, 
						const Dtype* target, 
						Dtype* diff) {
		CUDA_KERNEL_LOOP(i, count) {
			const int target_value = static_cast<int>(target[i]);
			if (target_value == ignore_label) {
				diff[i] = 0;
		}
	}
}

前向传播

头文件中的成员属性：

/// 一个SigmoidLayer类对象指针，预测值到概率值的映射
shared_ptr<SigmoidLayer<Dtype> > sigmoid_layer_;
/// 接收SigmoidLayer的输出.
shared_ptr<Blob<Dtype> > sigmoid_output_;
/// bottom vector holder to call the underlying SigmoidLayer::Forward
vector<Blob<Dtype>*> sigmoid_bottom_vec_;
/// top vector holder to call the underlying SigmoidLayer::Forward
vector<Blob<Dtype>*> sigmoid_top_vec_;
/// Whether to ignore instances with a certain label.
bool has_ignore_label_;
/// The label indicating that an instance should be ignored.
int ignore_label_;
/// How to normalize the loss.
LossParameter_NormalizationMode normalization_;
Dtype normalizer_;
int outer_num_, inner_num_;

先执行forward操作：sigmoid_layer_->Forward(_, _) 。其参数sigmoid_bottom_vec_和sigmoid_top_vec_是两个该类的成员变量，其值随操作的执行而改变，这里要改变的是前者，这个实现在源码中的成员函数LayerSetUp()。

sigmoid_layer_也是成员变量，其定义：shared_ptr<SigmoidLayer<Dtype> > sigmoid_layer_;。CPU和GPU实现见下：

CPU

template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Forward_cpu(
						const vector<Blob<Dtype>*>& bottom, 
						const vector<Blob<Dtype>*>& top) {
	// The forward pass computes the sigmoid outputs.
	// 1. Forward计算sigmoid 的输出，并且取数据
	sigmoid_bottom_vec_[0] = bottom[0];
	sigmoid_layer_->Forward(sigmoid_bottom_vec_, sigmoid_top_vec_);
	// Compute the loss (negative log likelihood)
	// Stable version of loss computation from input data
	const Dtype* input_data = bottom[0]->cpu_data();
	const Dtype* target = bottom[1]->cpu_data();

	// 2. 计算 对数似然
	int valid_count = 0;
	Dtype loss = 0;
	for (int i = 0; i < bottom[0]->count(); ++i) {
		const int target_value = static_cast<int>(target[i]);
		if (has_ignore_label_ && target_value == ignore_label_) {
			continue;
		}
		loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
			log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
		++valid_count;
	}
	normalizer_ = get_normalizer(normalization_, valid_count);
	top[0]->mutable_cpu_data()[0] = loss / normalizer_;
}

再看sigmoid_layer_->Forward(_, _);，SigmoidLayer类并没有Formard()方法，所以此方法一定是从其父类继承而来。看源码找到继承顺序：SigmoidLayer::NeuronLayer::Layer，所以这里的Forward()是Layer类的方法，祥看Layer.hpp。

GPU

与CPU类似，将CPU中的for循环由kernel函数代替：

template <typename Dtype>
__global__ void SigmoidCrossEntropyLossForwardGPU(const int nthreads,
		const Dtype* input_data, const Dtype* target, Dtype* loss,
		const bool has_ignore_label_, const int ignore_label_,
		Dtype* counts) {
	CUDA_KERNEL_LOOP(i, nthreads) {
		const int target_value = static_cast<int>(target[i]);
		if (has_ignore_label_ && target_value == ignore_label_) {
			loss[i] = 0;
			counts[i] = 0;
		} else {
			loss[i] = input_data[i] * (target[i] - (input_data[i] >= 0)) -
				log(1 + exp(input_data[i] - 2 * input_data[i] *
				(input_data[i] >= 0)));
			counts[i] = 1;
		}
	}
}

GPU中的前传播：

1	void SigmoidCrossEntropyLossLayer<Dtype>::Forward_gpu(){...}

函数体省略，不过在源码中有一点提出：

// Since this memory is not used for anything, we use it here to avoid having
// to allocate new GPU memory to accumulate intermediate results.
Dtype* loss_data = bottom[0]->mutable_gpu_diff();
Dtype* count_data = bottom[1]->mutable_gpu_diff();
...
...
// Clear scratch memory to prevent interfering with backward (see #6202).
caffe_gpu_set(bottom[0]->count(), Dtype(0), bottom[0]->mutable_gpu_diff());
caffe_gpu_set(bottom[1]->count(), Dtype(0), bottom[1]->mutable_gpu_diff());

这是CPU版本中没有的，因为kernel函数中需要传入对象数组，但是这部分的地址没有被开辟，所以为了避免在GPU上为中间结果开辟空间，所以使用Blob的暂时没有使用到的部分，作为临时存储空间，只不过，函数结束后要清理这部分空间。

其他

这个类除了上述的方法，还有其他方法详见源文件。

2020-06-02

Caffe

caffe-sigmoidLayer类

源码初体验，看一下sigmoid_layer类。

sigmoid_layers类

这个类的所有内容

头文件： ./include/caffe/layers/sigmoid_layer.hpp
CPU实现： ./src/caffe/layers/sigmoid_layer.cpp
GPU实现：./src/caffe/layers/sigmoid_layer.cu

对于这个类的官方文档见此

头文件sigmoid_layer.hpp中包含

继承自NeuronLayer::Layer类的构造函数：SigmoidLayer()
返回这个列的名字：type()
前先计算的CPU声明：Forward_cpu()和GPU声明：Forward_gpu()
后传计算的CPU声明：Backward_cpu()和GPU声明：Backward_gpu()

前向传播

CPU

前向计算是将bottom数据经过sigmoid函数得到top数据。所以其基本操作是sigmoid()。CPU实现：

1
2
3

Dtype sigmoid(Dtype x) {
	return 0.5 * tanh(0.5 * x) + 0.5;
}

有了sigmoid()，前向传播计算如下：

void SigmoidLayer<Dtype>::Forward_cpu(vector<Blob<Dtype>*>& bottom, vector<Blob<Dtype>*>& top) {
	Dtype* bottom_data = bottom[0]->cpu_data();
	Dtype* top_data = top[0]->mutable_cpu_data();
	int count = bottom[0]->count();

	for (int i = 0; i < count; ++i) {
		top_data[i] = sigmoid(bottom_data[i]);
	}
}

Blob是caffe中最小的数据载体，Blob的定义见Blob的笔记博客。

GPU
sigmoid()对应的GPU实现：

__global__ void SigmoidForward(const int n, const Dtype* in, Dtype* out) {
	CUDA_KERNEL_LOOP(index, n) {
		out[index] = 0.5 * tanh(0.5 * in[index]) + 0.5;
	}
}

其中CUDA_KERNEL_LOOP(index, n)给定线程id，并且将线程映射到数据上，实现数据并行。其宏定义在这里include/caffe/util/device_alternate.hpp：

// CUDA: grid stride looping
#define CUDA_KERNEL_LOOP(i, n) \
 	for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
      i < (n); \
      i += blockDim.x * gridDim.x)

这是个通用的循环，具体细节见关于CUDA的笔记博客。

同样的，GPU的前行传播：

void SigmoidLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
	Dtype* bottom_data = bottom[0]->gpu_data();
	Dtype* top_data = top[0]->mutable_gpu_data();
	int count = bottom[0]->count();

	SigmoidForward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
	count, bottom_data, top_data);
	CUDA_POST_KERNEL_CHECK;
}

其中指定了当下机器每block可用threads数目，并可计算出使用到的block数。

具体地：CAFFE_CUDA_NUM_THREADS=512，每个block启用512个threads，而 CAFFE_GET_BLOCKS(count)：

1
2
3

inline int CAFFE_GET_BLOCKS(const int N) {
	return (N + 512 - 1) / 512;
}

对于像sigmoid简单的算子，直观上看，GPU实现其实就是将CPU实现的最内层的循环去掉，用并行执行的kernel函数替代。

后向传播

CPU

根据sigmoid 反向传播公式可以很容易写出如下：{将code中去掉的const都加上}

void SigmoidLayer<Dtype>::Backward_cpu(
							vector<Blob<Dtype>*>& top,
							vector<bool>& propagate_down,
							vector<Blob<Dtype>*>& bottom) {
	if (propagate_down[0]) {
		Dtype* top_data = top[0]->cpu_data();
		Dtype* top_diff = top[0]->cpu_diff();
		Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
		int count = bottom[0]->count();
		for (int i = 0; i < count; ++i) {
			const Dtype sigmoid_x = top_data[i];
			bottom_diff[i] = top_diff[i] * sigmoid_x * (1. - sigmoid_x);
		}
	}
}

其中top_diff[i]是与前行传播的输出有关的数值。

GPU

对于GPU实现，只需将上述code 中最内层循环用kernel函数代替，所以要实现kernel函数：{将code中去掉的const都加上}

__global__ void SigmoidBackward(const int n, 
							Dtype* in_diff,
							Dtype* out_data, 
							Dtype* out_diff) {
	CUDA_KERNEL_LOOP(index, n) {
	Dtype sigmoid_x = out_data[index];
		out_diff[index] = in_diff[index] * sigmoid_x * (1 - sigmoid_x);
	}
}

替换循环：{将code中去掉的const都加上}

void SigmoidLayer<Dtype>::Backward_gpu(vector<Blob<Dtype>*>& top,
							vector<bool>& propagate_down,
							vector<Blob<Dtype>*>& bottom) {
	if (propagate_down[0]) {
		Dtype* top_data = top[0]->gpu_data();
		Dtype* top_diff = top[0]->gpu_diff();
		Dtype* bottom_diff = bottom[0]->mutable_gpu_diff();
		int count = bottom[0]->count();

		SigmoidBackward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
			count, top_diff, top_data, bottom_diff);
		CUDA_POST_KERNEL_CHECK;
	}
}

上述很直接。

敲黑板
技巧：在linux中使用grep命令可以在一个项目中查找关键字：

1	grep -n -H -r "CUDA_KERNEL_LOOP"

2020-03-26

Utility

anaconda 虚拟环境

conda 虚拟环境

conda使得在不同项目中使用不同版本的包包，不同环境中的包互不冲突。
而且可以指定包的版本，非常方便。

常用命令：

conda env list        # 列出已存在的虚拟环境
conda create --name yolo python=3.5  #新建yolo环境并且安装python3.5

conda activate yolo   #进入或者切换到yolo
conda deactivate
conda info --envs

conda search keras   #搜索keras的所有可下载版本
conda list -n yolo   #列出yolo环境中已有 包
conda install -n yolo keras==2.1.5  #向指定环境中安装指定的包

conda remove -n yolo keras
conda upgrade -n yolo keras

conda remove -n yolo --all    #删除整个yolo环境
conda create -n yolo --clone yolov3   #复制yolo环境

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes   #设置搜索时显示通道地址
conda config --show      #产看镜像源

trouble shooting

错误：

1	conda install: Segmentation fault

原因：由于网络或者其他原因，包下载不完整。

解决：清除所有不完整的缓存，后重新安装。

1	conda clean -a

2020-03-12

LeetCode

LeetCode-符串通配符

自己处理输入输出

描述：

实现如下2个通配符：
1. *：匹配0个或以上的字符（字符由英文字母和数字0-9组成，不区分大小写。下同）
2. ？：匹配1个字符
  
  input:先输入一个带有通配符的字符串，再输入一个需要匹配的字符串。如：
  
  te?t*.*
  txt12.xls
  
  output:返回匹配的结果，正确输出true，错误输出false。如上例返回false。
思路：
1. 终止条件先后有序
2. 对于if(*str1 == '*')中，三个递归match，好好体会三种情况
  1. a*c, ac。*与0个匹配
  2. a*c, abc。*与1个匹配
  3. a*c, abbbc。*与多个匹配

实现：

#include<iostream>
using namespace std;

bool match(char* str1, char* str2){
    // 终止条件 同时到字符串尾，放回true
    if(*str1 == '\0'  && *str2 == '\0')
        return true;
    // 只有一个到尾，返回false。不会两者都到尾，因为上一个if判断过了
    else if(*str1 == '\0' || *str2 == '\0')
        return false;
    // 对于‘？’，一定匹配，所以查看下一对字符
    if(*str1 == '?')
        return match(str1+1, str2+1);
    // 当两个字符相等，一定匹配，查看下一对字符
    else if(*str1 == *str2)
        return match(str1+1, str2+1);
        
    // 对于‘*’, 匹配零个，一个或多个
    else if(*str1 == '*')
        return match(str1+1, str2) ||    //零个
               match(str1+1, str2+1) ||  // 一个
               match(str1, str2+1);      // 多个
    return false;
}

int main(){
    char str1[100], str2[100];
    while(cin>>str1>>str2){
        if(match(str1, str2))
            cout<<"true"<<endl;
        else
            cout<<"false"<<endl;
    }
    return 0;
}