caffe-sigmoidLayer类

源码初体验，看一下sigmoid_layer类。

sigmoid_layers类

这个类的所有内容

头文件： ./include/caffe/layers/sigmoid_layer.hpp
CPU实现： ./src/caffe/layers/sigmoid_layer.cpp
GPU实现：./src/caffe/layers/sigmoid_layer.cu

对于这个类的官方文档见此

头文件sigmoid_layer.hpp中包含

继承自NeuronLayer::Layer类的构造函数：SigmoidLayer()
返回这个列的名字：type()
前先计算的CPU声明：Forward_cpu()和GPU声明：Forward_gpu()
后传计算的CPU声明：Backward_cpu()和GPU声明：Backward_gpu()

前向传播

CPU

前向计算是将bottom数据经过sigmoid函数得到top数据。所以其基本操作是sigmoid()。CPU实现：

1
2
3

Dtype sigmoid(Dtype x) {
	return 0.5 * tanh(0.5 * x) + 0.5;
}

有了sigmoid()，前向传播计算如下：

void SigmoidLayer<Dtype>::Forward_cpu(vector<Blob<Dtype>*>& bottom, vector<Blob<Dtype>*>& top) {
	Dtype* bottom_data = bottom[0]->cpu_data();
	Dtype* top_data = top[0]->mutable_cpu_data();
	int count = bottom[0]->count();

	for (int i = 0; i < count; ++i) {
		top_data[i] = sigmoid(bottom_data[i]);
	}
}

Blob是caffe中最小的数据载体，Blob的定义见Blob的笔记博客。

GPU
sigmoid()对应的GPU实现：

__global__ void SigmoidForward(const int n, const Dtype* in, Dtype* out) {
	CUDA_KERNEL_LOOP(index, n) {
		out[index] = 0.5 * tanh(0.5 * in[index]) + 0.5;
	}
}

其中CUDA_KERNEL_LOOP(index, n)给定线程id，并且将线程映射到数据上，实现数据并行。其宏定义在这里include/caffe/util/device_alternate.hpp：

// CUDA: grid stride looping
#define CUDA_KERNEL_LOOP(i, n) \
 	for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
      i < (n); \
      i += blockDim.x * gridDim.x)

这是个通用的循环，具体细节见关于CUDA的笔记博客。

同样的，GPU的前行传播：

void SigmoidLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
	Dtype* bottom_data = bottom[0]->gpu_data();
	Dtype* top_data = top[0]->mutable_gpu_data();
	int count = bottom[0]->count();

	SigmoidForward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
	count, bottom_data, top_data);
	CUDA_POST_KERNEL_CHECK;
}

其中指定了当下机器每block可用threads数目，并可计算出使用到的block数。

具体地：CAFFE_CUDA_NUM_THREADS=512，每个block启用512个threads，而 CAFFE_GET_BLOCKS(count)：

1
2
3

inline int CAFFE_GET_BLOCKS(const int N) {
	return (N + 512 - 1) / 512;
}

对于像sigmoid简单的算子，直观上看，GPU实现其实就是将CPU实现的最内层的循环去掉，用并行执行的kernel函数替代。

后向传播

CPU

根据sigmoid 反向传播公式可以很容易写出如下：{将code中去掉的const都加上}

void SigmoidLayer<Dtype>::Backward_cpu(
							vector<Blob<Dtype>*>& top,
							vector<bool>& propagate_down,
							vector<Blob<Dtype>*>& bottom) {
	if (propagate_down[0]) {
		Dtype* top_data = top[0]->cpu_data();
		Dtype* top_diff = top[0]->cpu_diff();
		Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
		int count = bottom[0]->count();
		for (int i = 0; i < count; ++i) {
			const Dtype sigmoid_x = top_data[i];
			bottom_diff[i] = top_diff[i] * sigmoid_x * (1. - sigmoid_x);
		}
	}
}

其中top_diff[i]是与前行传播的输出有关的数值。

GPU

对于GPU实现，只需将上述code 中最内层循环用kernel函数代替，所以要实现kernel函数：{将code中去掉的const都加上}

__global__ void SigmoidBackward(const int n, 
							Dtype* in_diff,
							Dtype* out_data, 
							Dtype* out_diff) {
	CUDA_KERNEL_LOOP(index, n) {
	Dtype sigmoid_x = out_data[index];
		out_diff[index] = in_diff[index] * sigmoid_x * (1 - sigmoid_x);
	}
}

替换循环：{将code中去掉的const都加上}

void SigmoidLayer<Dtype>::Backward_gpu(vector<Blob<Dtype>*>& top,
							vector<bool>& propagate_down,
							vector<Blob<Dtype>*>& bottom) {
	if (propagate_down[0]) {
		Dtype* top_data = top[0]->gpu_data();
		Dtype* top_diff = top[0]->gpu_diff();
		Dtype* bottom_diff = bottom[0]->mutable_gpu_diff();
		int count = bottom[0]->count();

		SigmoidBackward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
			count, top_diff, top_data, bottom_diff);
		CUDA_POST_KERNEL_CHECK;
	}
}

上述很直接。

敲黑板
技巧：在linux中使用grep命令可以在一个项目中查找关键字：

1	grep -n -H -r "CUDA_KERNEL_LOOP"