FPGA-based CNN Accelerators with OpenCL
Convolutional Neuronal Nets (CNNs) are state-of-the art Neuronal Networks which are used in many fields like video analysis, face detection or image classification. Due to high requirements regarding computational resources and memory bandwidth, CNNs are mainly executed on special accelerator hardware which is more powerful and energy efficient than general purpose processors. This paper will give an overview of the usage of FPGAs for accelerating of computation intensive CNNs with OpenCL. Therefore, two different implementation alternatives are proposed. The first approach is based on nested loops, which are inspired by the mathematical formula of multidimensional convolutions. The second strategy transforms the computational problem into a matrix multiplication problem on the fly. The approaches are followed by common optimization techniques used for FPGA designs based on high level synthesis (HLS). Afterwards, the proposed implementations are compared to an CNN implementation on an Intel Xeon CPU. This makes it possible to demonstrate the advantages in terms of performance and energy efficiency.
Convolutional Neuronal Networks (CNNs) are a class of feed-forward artificial neural networks which are inspired by the visual cortex of the brain. They are commonly used for computer vision applications like face recognition, image classification or a wide range of speech recognition applications. During the last decade, there were significant accuracy improvements due to enhanced network structures. Some of these improvements are made possible by advances in the field of integrated circuits, which enable the processing of large amounts of data through reduced structure sizes and massive parallelism. Besides that, embedded application such as Google’s reverse image search are running on devices with limited computational power and small battery lifetimes. This is another reason why energy-efficient implementations are needed. The training is usually done on high performance GPUs and other accelerators like Intel Xeon Phi coprocessors2. On the other hand some vendors create their own specific accelorator processors, like Google’s Tensor Processing Unit4.
Field programmable gate arrays (FPGA) have been increasingly used for this purpose for several years. The advantages using FPGAs are flexibility and energy efficiency. Unlike ASICs, FPGAs can be reprogrammed in the field. Since they have thousands of computational units and fast on-chip memory make itself useful for massive parallel computations. Due to the usage of high level languages like OpenCL for the development, the time to market could be reduced. Thus, also larger designs can be realized in an acceptable time.
This paper is structured in the following way. Chapter 2 presents related work for this research field. In chapter 3, background information regarding FPGA technology and deep neuronal networks with CNN layers is presented. Two different approaches for implementing CNNs on FPGAs are shown in chapter 4. Since optimization is a big challenge for designing FPGA accelerators, common techniques are described there. Before concluding, a comparison between the two implementations and a CPU implementation follows.
2 Related Work
Several accelerators for computation intensive problems have been proposed over the last decade. Most of the computational effort of CNNs is effected by the convolution operations which can basically be implemented sequentially by nested loops or by matrix multiplications calculus. Matrix multiplications are used by Zhang et al. (9 ) and Suda et al. (7 ). The nested loop approach is used by Zhang (8 ) as a base for further optimization.
C. Zhang et al. (8 ) and C. Zhang et al. (10 ) consider of on-chip buffers for data reuse technique in order to optimize the CNN computation and memory access. Much research has been done in optimizing data representation (Gupta et al.,3 ; Courbariaux et al.,1 ). Another approach for data type optimization is introduced by Z. Zhang et al. (10 ). They reduced the floating-point data size to 9 bits and implemented quad multiplier operation on a single DSP element.
While GPUs and other specialized hardware like Intel Xeon Phi processors are widely used for accelerating computing systems, FPGAs have been integrated in high performance computing environments recently. For example cloud platform providers, such as Amazon and Microsoft, offer FPGAs to their customers. The use of OpenCL as a hardware description language contributes to this significantly.
Field Programmable Gate Arrays are standardized integrated circuits whose functionality can be configured after production in the field. On the chip, there are standardized digital elements. The configuration and the connection between them are specified in the synthesis process.
The main components of the chip are logic elements which structure depends on the specific FPGA type. In general, logic elements consists of one or more lookup tables which have a register on it’s output. The lookup table is needed to realize combinatoric logic functions (e.g logical and or logical or-functions). The registers are used to store the result for one or more cycles.
Additionally, FPGAs consist of configurable components like on-chip memory, routing networks, clock-trees and digital signal processing (DSP)-elements which allow higher clock frequencies than similar computational units built with logic elements. On-chip block ram memory (BRAM) can be used for full performance access of data for the computational units. Most of these blocks have two independent memory ports which allow independent access (read or write) operations at the same time.
For accelerator purposes, FPGAs are placed on PCI express accelerator cards. In addition to the FPGA chip on the board, the accelerator cards have components for power supply, external memory and other interfaces, such as PCIe or Ethernet.
The design implementation for large FPGAs is meanwhile done in high level synthesis (HLS) languages like SystemC or OpenCL. Thus, the time to market is significant reduced compared to classical hardware description languages like VHDL or Verilog. The global market is dominated by two big vendors: Intel (previously Altera) and Xilinx. Both are providing individual toolkits for the development of FPGA designs to their costumers.
Convolutional Neuronal Networks (CNN) are a class of feed-forward artificial neuronal networks which are inspired by the human cortex. State-of-the-art CNNs consist of one or more convolution layers combined with non-linear activation functions, normalization functions and pooling layers. Normalization functions are intended to normalize each output by a factor which depends on the neighbors of that specific neuron. The normalization yields to a standard normal distribution of the output data. Non-linear activation functions like ReLu f (x) = max(0,x) or Sigmoid g(x) = (1 + exp(—t))-1 are added to decouple layers from each other and to introduce non-linearity into the net. Pooling layers are used to reduce the feature dimensions. The last layers of the whole net are usually built as fully-connected layers for classification. Fully connected layers are neuronal nets where all inputs of the next layer are connected to each output of the previous layer.
The most critical part regarding computational performance in CNNs are the convolution operations which contributes most of the computational effort to the neuronal net. They create a set of feature maps from the input by compute convolutions on the input feature map. As shown in figure 1, the input feature map (left side) is transformed into one 2D output map (right side) with two K x K filter kernels (middle).
In general, convolution layers generate Nout output feature maps from Nin input feature maps. Every dimension of the output feature map is the result of the convolution with a Kh x Kw x Nin filter kernel, where Kh is the height and Kw is the width of the filter ker-
Abbildung in dieser Leseprobe nicht enthalten
Figure 1: CNN operation: The scalar product of input times filter results to the output feature map element
nel. Typically, the kernel is squared such that Kh = Kw = K. In fact, there exists Nout of these kernels such that the overall dimension of the filter kernel is Kh x Kw x Nin x Nout. The computation of element at position (xw ,xh) in the i-th output map from the j-th input map is shown in equation 1. References to negative indexes or to indexes outside of the input map are responded with zero. For computing the complete set of output feature maps, the number of sums increase from two (equation 1) to six.
Abbildung in dieser Leseprobe nicht enthalten
The implementation of CNNs consumes a considerable amount of computational resources as well as also a significant amount of memory. Large CNN-based architectures like AlexNet have more than 60 million model parameter. These parameters consume about 250 MB memory assuming 32 bit data type size. Since FPGAs do not provide on-chip memory on this scale, these parameters have to be stored in external memory. Therefore, the external memory bandwidth can become a performance bottleneck. Because of that, the challenge in the implementation part is to optimize the flow of data to and from the compute units.
In this chapter, two different approaches for CNN computation are presented. The first approach uses nested loops for the implementation which corresponds to the implementation of equation 1 basically. The second approach transforms the problem into matrix multiplications on-the-fly. In this form, standard matrix multiplication architectures can be used for computation.