Design of parallelized HDWT Hardware Architecture with an Increased Throughput

James Ntaganda
Engineering Faculty Staff, University of Rwanda, College of Science and Technology Kigali, Rwanda
PhD Candidate at Konkuk University, Electronic Engineering Department, Seoul-Korea

ABSTRACT
Discrete Wavelet Transform-based applications such as Multimedia CODECs require intensive computation. With modern technologies, Hardware-Software Co-designing has become a common practice. Such designs are regarded as SoC (System on Chip), where designers dedicate hardware modules for computational-intensive functions and are called routinely by the main program code, which is why they are called hardware accelerators. Comparatively, multiplication operation takes longer, consumes more processing power and utilizes more hardware resources. Keeping other factors constant, any technique that can reduce multiplication operations is considered a better approach in hardware designs. This paper is based on our previous published concept, where we exploited the simplicity of Haar function and proposed a recursive structure of a non-normalized Haar Discrete Transform. We implemented it as a multiplier-less Haar DWT (HDWT) hardware module which can spatially transform 16x16 arrays. However, in our previous work, most of the processing nodes remained idle and exhibited less throughput. To increase parallelization and hence throughput, in this paper we propose "mirror-reflection" approach. This reduces the number of redundant and idle processing nodes in the circuit module and makes the circuitry scalable. With this approach, circuit redundancy was reduced and circuit utilization efficiency increased from 47% to 69%. The scalable hardware Module is implemented using FPGA.

General Terms
Signal processing, Hardware design, Multimedia compression

Keywords
Haar DWT, Image Compression, Hardware Implementation, Mirror-reflection, FPGA

1. INTRODUCTION
Images and video clips are becoming our daily basis for information sharing and storage. Such growing demand, enormous binary information in multimedia content, scarcity of bandwidth and storage memory requirements dictate techniques for video compression [1]. Compression is done by means of trade-off. The video or image quality is traded-off to transmission and storage media gains. However, depending on applications, such a trade-off has to be done by preserving a considerable perceptual video quality for HVS (Human Visual Systems). In most of multimedia compression processes, CODECs will have to involve spatial/temporal Transformations, Quantization and information source coding from encoder side while the reverse is done in decoding process [1],[6],[9]. Discrete Wavelet Transforms have proved to be competitive in MPEG2000 and its latest version such as MJPEG-XR and MJPEG-LS [6]. MJPEG2000 CODEC and its successors are DWT-based CODECs. Contrary to Discrete Cosine Transforms (DCTs), DWT features progressive image transmission by quality or resolution, lossy and lossless compressions, region of interest (ROI) encoding, and good error resilience [10]. Yusong Hu, Viktor K. Prasanna [14] emphasise that MIPEG latest standard series have chosen DWT instead of DCT due to its good energy compaction and inherent multi-resolution image presentation ability. Dragomir El Mezeni et al [6], describe JPEG-XR architecture in detail. Similarly, Christian Perra [7] suggested two architectures of MJPEG2000 and MJPEG-XR, both of which deploy DWT at their transformation stage. Such Transformations require intensive computation, consume more power and occupy larger area on chip in hardware designs. In this paper we use the Haar DWT (HDWT) developed in our previous work [1] and design a "mirror-reflected" structure of non normalised Haar discreet transform. This is done with an aim of increasing the throughput as well as reducing circuitry redundancy and idleness. Thus, we increase circuit utilisation efficiency. The designed circuit structure can process arrays which may represent a video or an image frame micro block. Similar approach can be used in customizing designs to larger or smaller modules depending on the size candidate micro blocks to be processed. The rest of this paper is organized as follows: section two gives a brief review of Haar DWT. Section 3 shows circuitry design details, section four shows more analysis of the circuit and section five presents conclusion.

2. DISCRETE WAVELET TRANSFORM
Wavelets are mathematical tools that can allow a function to be described in terms of a coarse overall shape, plus details that range from broad to narrow [2]. Regardless of whether the function of interest is an image, a curve, or a surface, Wavelets offer an elegant technique for representing the levels of detail present [2]. For any spatial data of NxN dimensions, each of its rows and columns can be represented as a linear combination of averaged and detailed coefficients (matrix entries) [4]. A. Jensen and A. la [4] presents a thorough reading on Ripples in Mathematics the Discrete Wavelet Transform and their hierarchical decompositions. For more topics about Discrete Wavelet Transform [4] avails a good reading.

2.1 Haar Function, its Scaling and Shifting Nature
In Mathematics, Haar function has existed for decades but our major objective in this paper is to use its scaling and shifting nature and design a recursive and flexible structure for multiplier-less operation in multimedia compression. The end result is the structure that can be “mirror-reflected” presented in figures 2 and 3 for flexible hardware module realization.
Using equations (1) and (2), by scaling and shifting, in our previous work [1], we used MATLAB\textsuperscript{TM} simulation to generate Haar wavelet MRA (Multi-resolution Analysis) graphical representation. These graphical representations made it easier to comprehend equations associated with this design.

2.2 Haar Wavelet Transforms Function and its basis

All images or video frames can be spatially represented in form of array vectors to store respective rows and columns of an N x M resolution [1]. In arrays where N=M, the whole image is stored in a square matrix representation. In [1], vector, $I \in \mathbb{R}^{16}$ was represented in its basis functions. We can represent $I(x)$ in terms of Haar wavelet coefficients by equation (3) [1]

$$I(x) = \sum_{i,j \in \mathbb{Z}} <I, \psi_{i,j}> \psi_{i,j} \quad (3)$$

By using equations (1) and (2), $I \in \mathbb{R}^{16}$ can be represented by equation (4) [1]

$$I_0 = I_0 \phi_{0,0} + I_1 \psi_{0,0} + I_2 \psi_{1,0} + I_3 \psi_{1,1} + I_4 \psi_{2,0} + I_5 \psi_{2,1} + I_6 \psi_{3,0} + I_7 \psi_{3,1}$$
$$I_{10} = I_{10} \phi_{0,0} + I_{11} \psi_{0,0} + I_{12} \psi_{1,0} + I_{13} \psi_{1,1} + I_{14} \psi_{2,0} + I_{15} \psi_{2,1} \quad (4)$$

Using the scaling functions and shifting operations [1], we can obtain the wavelet components of $I \in \mathbb{R}^{16}$. It follows that each wavelet coefficient in MRA and orthogonal basis representation is given by equations (5) to (20) in [1], represented as equations (5) and (6) in this paper. In these equations, the scaled and shifted function $\psi_{i,j}$ denotes the unity shifted version of mother wavelet along the intervals. Because of limited space in this paper, only $I_0$ and $I_{15}$ are presented for a quick understanding of the design. For details about derivation and representation of $I_0$ to $I_{15}$, our previous work [1] reserves a better reading.

2.3 Matrix Representation of Transform Equations

Using $I_0$ to $I_{15}$ equivalencies, all equations can be represented in a matrix format as follows.
3. HARDWARE DESIGN

Yusong Hu and Viktor K. Prasanna [15], presented a good design for FPGA. However, in their design, they used two separate circuits to process rows and columns separately. Furthermore, they have used Parameterized Lifting-Based and involves multipliers. M. Nagabushanam [16] modified Distributive Arithmetic (DA) to avoid multipliers but their serialized shifters reduce the throughput of hardware modules which has to be used as hardware accelerator. Parvatham Vijay and Seetharaman Gopalakrishnan [17] presented a novel Approach by FPGA and ASCI. However like the fore mentioned references, their Lifting scheme approach involves alternations of shifters and filter coefficients (floating points) hence not as elegant for hardware implementation, especially in digital circuit designs. Compared to other DWT, Haar is considered the more relaxed in terms of hardware design and exactly reversible without the edge effects faced with other wavelet transforms Khamees Khalaf Hasan [18]. Revisiting the conventional recursive architectures of FFT, DCT and DST, it was indeed possible to explore a better design for a Faster Discrete Wavelet Transform. By re-arranging 16 inputs elements of the candidate array, we were able to construct a recursive structure depicted in figure 1[1]. The major drawback of it is that the circuit processing nodes would remain redundant and hence hardware module remains underutilized. Contrary, by using “mirror-reflection” approach, in this paper, hardware sharing and parallelization is possible, thus making our proposed structure better for multidimensional Image/video processing

3.1 Proposed Faster HDWT Structure

In this paper, we re-arrange the inputs starting with $S$, for rows of a matrix to be processed and $S$, denoted re-arranged input from columns of the same matrix. In both arrangements from the input buffer “n” ranges from 0 to 15 in a 16-array input. By imaginary (virtual) mirror reflection from the central axis of the circuit in figure 1, futures 2 and 3 are re-drawn to represent reflected circuit and overall design respectively. This approach increased circuitry utilization efficiency

<table>
<thead>
<tr>
<th>Table 1. Possible downscaling circuitry combinations</th>
</tr>
</thead>
<tbody>
<tr>
<td>Design</td>
</tr>
<tr>
<td>Possible downscaled circuits</td>
</tr>
</tbody>
</table>

Figure 1: Non-reflected circuit with idle processing nodes [1] from matrix in equation (7)
Figure 2: “Mirror –reflected” circuit with idle processing nodes

Figure 3: Overall circuit design with increased throughput and parallelization structure

Table 2. Circuit node utilization efficiency

<table>
<thead>
<tr>
<th>Design</th>
<th>Total nodes</th>
<th>Idle nodes</th>
<th>Used nodes</th>
<th>Efficiency</th>
</tr>
</thead>
<tbody>
<tr>
<td>Figure 1</td>
<td>64</td>
<td>34</td>
<td>30</td>
<td>47%</td>
</tr>
<tr>
<td>Figure 3</td>
<td>64</td>
<td>20</td>
<td>44</td>
<td>69%</td>
</tr>
</tbody>
</table>

4. CIRCUIT ANALYSIS, PLACE AND ROUTING TO FPGA

From the circuitry setup in figures 3 and 4, U1 represents the processing unit between S nodes and X nodes. U2 is between X and Y, U2 is between Y and Z while U4 is between Z and I as it emerge to the output buffer. It can be seen that as soon as the output of U1 emerges at time t1, the next input from the
buffer can be fed in to the input immediately. In so doing, the throughput can be increased. Contrary to our previous design, both columns and rows can be processed with only a time difference of one cycle. The only additional work in this design is pre and post re-arrangement of array elements to be processed. Judging from figures 1 and 3, it can be shown that the circuit utilization efficiency has increased from 47% to 69%. This is depicted in table 2. It is worthy to mention in this section that circuit in figure 3 also allows the scalability of NxN arrays. For 16x16, there is only one possible combination. On the other hand, if 8x8 array processing is required, there are two possible parallel circuits that can be found within figure 3. This can be done by bypassing S nodes and feeding the inputs to X nodes. Similarly, bypassing S and X nodes, 4 parallel circuitry can be achieved to process a 4x4 array. This is very important in video processing where different micro block sizes are processed separately based on region of interest and pixel homogeneity. These combinations are depicted in table 1.

**Figure 4: Parallel processing per clock cycles**

**Figure 5: Cyclone IV E-EPCE115F29C7 FPGA wire bonding**

5. CONCLUSION

In this paper, flexibility and simplicity of Haar Discrete Transform is exploited to draw a recursive structure for a 16-element array input. Due to redundancy of the processing nodes, a mirror-reflection approach is used. The hardware module is implemented Cyclone IV E-EPCE115F29C7 FPGA. Circuit Redundancy was reduced utilization efficiency increased from 47% to 69% compared to a “non-mirrored” module structure. The module can be used to transform a 16x16 Micro block of a video frame with a high throughput by dual clock design. Furthermore where Scalable micro block processing is required, the input arrays can be down scaled from 16x16 down to 4x4. In the next work, we shall embed this hardware module into SoC systems for more advanced hardware-software co-design.

6. REFERENCES


[14] Zunera Idrees, Eliza Hashemiaghjekandi, Image Compression by Using Haar Wavelet Transform and Singular Value Decomposition, School of computer Science, Physics and mathematics Linnaeus University  


