## ARPN Journal of Engineering and Applied Sciences ©2006-2016 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com ## DESIGN AND PERFORMANCE ANALYSIS OF BCSE ALGORITHM AND HAN CARLSON ADDER BASED MAC UNIT Oindrila Bhattacharya, T. Ravi and V. Vijayakumar Department of Electronics and Communication Engineering, Sathyabama University, Chennai, Tamil Nadu, India E-Mail: optimistic.oindrila@gmail.com #### ABSTRACT This paper presents the analysis of Multiply-Accumulate (MAC) architecture for DSP applications. In VLSI, arithmetic cells including adders and multipliers are the most commonly used components. A MAC unit consists of a multiplier in combinational logic followed by an adder and an accumulator register that stores the result. Efficient implementation of MAC Unit is crucial in most of the microprocessors and digital signal processors (DSPs). An efficient constant multiplier architecture based on vertical-horizontal binary common sub-expression elimination (VHBCSE) algorithm may be used to design an efficient MAC Unit.4-bit binary common sub-expression elimination (BCSE) algorithm has been applied vertically across adjacent coefficients on the 2-D space of the coefficient matrix initially, followed by applying variable-bit BCSE algorithm horizontally within each coefficient. This is capable of reducing the average switching activity of the multiplier block. The proposed architecture was applied to MAC unit and compared against the conventional compressor based MAC units and applied to DSP applications to check its performance. To speed up the addition, Han Carlson adder is introduced. Parallel prefix adders provide good results as compared to the conventional adders. Keywords: multiply-accumulate unit, vertical-horizontal binary common sub-expression elimination (VHBCSE) algorithm, han carlson adder, binary common sub-expression elimination (BCSE). #### 1. INTRODUCTION The Multiply and Accumulate (MAC) Unit is the most typical feature that differentiates a DSP from any General Purpose Processor. All DSP Algorithms would require some form of the Multiplication and Accumulation Operation at some stage of their working. This is the most important block in DSP systems, where the 4-bit BCSE algorithm and Han Carlson Adder has been implemented in MAC Unit. #### 2. MULTIPLY ACCUMULATE CONTROL UNIT It is composed of an adder, multiplier and the accumulator. Use of hardware description languages (HDL) in the digital design process is important due to the growing complexity in the digital arena. A hardware description language allows a digital system to be designed and debugged at a higher level before conversion to the gate and flip-flop level. Digital signal processing (DSP) applications constitute the critical operations which usually involve many multiplications and accumulations. Thus, high throughput multiplier accumulator (MAC) is always a key element to achieve a high-performance digital signal processing application for real time signal processing applications. Figure-1. Block diagram of the MAC unit. In the last few years, the major consideration of MAC design has been to enhance its speed. This is because speed and throughput rate are always the concerns of digital signal processing systems. Due to the increase of portable electronic products, low power designs have also become major considerations. The MAC provides unit high-speed multiplication, with addition. MAC is composed of an adder, multiplier and an accumulator. The implementation of the multiplier is in the form of VHBCSE algorithm. The adder used is Han Carlson Adder. The layout of this adder is simple which allows for faster operation. The inputs for the MAC are fed to the multiplier block of the MAC, which will perform multiplication and give the result to the adder which will accumulate the result and then will store the result. #### www.arpnjournals.com #### 3. EXISTING MAC UNIT Vertical and horizontal BCSEs are the two types of BCSE used for eliminating the BCSs present in any BCSE method. Vertical BCSE produces more effective elimination than the horizontal BCSE. However a new BCSE algorithm which is a combination of vertical and horizontal BCSE for designing an efficient MAC unit may also be used. In the algorithm a 2-bit vertical BCSE has been applied first on the adjacent coefficient, followed by horizontal BCSEs to detect and eliminate as many BCSs as possible which are present within each of the coefficient [4, 7]. The MAC Unit has better results than conventional multipliers. Application of 2-bit VCSE to these filter coefficients to generate the partial products requires one adder of 17 full adder cells. Therefore multiplier requires 1, 2, 1, 2, and 1 number of adders consisting of 17-bit, 16-bit, 13-bit, 9-bit, and 5-bit respectively, a total number of 85 full adder cells to sum up the partial products [6]. Hence, total requirement of the full adder cells amounts to 249. # 4. 4-BIT BCSE ALGORITHM AND HAN CALSON ADDER BASED MAC UNIT The VHBCSE algorithm based constant multiplier architecture, has been coded using Verilog hardware description language using Xilinx ISE synthesis tool. As there is trade-off between area and delay, for the fair comparison, slices and LUTs can be considered. The results depicted indicate that the delay is better than those of earlier reported design using 2-bit BCSE and Vedic MAC unit. The proposed design has improvement in the delay [6]. #### A) 4-BIT BCSE algorithm The designed multiplier considers the length of the input (Xin) and coefficient (H) as 16-bit values while the output is assumed to be 35-bit long. In the 2-bit BCSE algorithm in the layer-1, 2-bit binary common sub-expressions (BCSs) ranging from "00" to "11" have been considered, which will produce 4 partial products. But, within four of these BCSs, a single adder (A0) will be required to generate the partial product only for the pattern "11"; the rest will be generated by hardwired shifting. For the 4-bit BCSE algorithm the layer 1 contains 4 bit binary common sub-expressions (BCSs) ranging from "0000" to "1111". The adders are needed for shift addition of the sub-expressions. For each condition between "0000" to "1111" there are adders which add the input as per the value of the coefficient segment (4-bit). #### i. Partial Product Generator (PPG) In BCSE method, shift and add based technique has been used to generate the partial product which will be summed up in the following layers for producing the final multiplication result. Figure-2. Partial product generator. #### ii. Control Logic (CL) generator Control logic generator block takes the multiplexed coefficient (Hm [15:0]) as its input and groups it into one of 4-bit each. The control logic generator block produces control signals depending on the different cases of the grouped input values each of 4-bit[4,3]. #### iii. Layer-2 addition The partial products (PP) generated from groups of 4-bit BCSs and are added up for the final multiplication results. Figure-3. Addition at layer 2. According to the BCSE algorithm proposed earlier, layer-2 requires four addition (A1-A4) operations to sum up the eight partial products. Instead of direct addition of these partial products, the layered addition operations are performed at layer 2 according to the VHBCSE algorithm. These adders (A1-A4) are generated based on 4-bit BCSE. #### iv. Layer-4 addition This block performs the addition operation between the two sums (AS5-AS6) produced by layer-3 to produce the multiplication result between the input and the coefficient. The Han Carlson adder is used to perform the final addition and produce the final output [6, 8]. #### www.arpnjournals.com Figure-4. Complete operation of the 4-bit BCSE MAC unit. #### B) HAN Carlson adder unit In VLSI, binary adders are important elements of processor chips. Adders are extensively used as a part of the DSP system. Ripple carry adder is the first and most fundamental adder that is capable of performing binary number addition. Since its latency is proportional to the length of its input operands, it is not very useful. To speed up the addition, carry look ahead adder is introduced. Parallel prefix adders provide good results as compared to the conventional adders. The adders with the large complex gates will be too slow for VLSI, so the design is made more efficient by breaking it into trees of smaller and faster adders which are more readily implemented. For large adders the delay of passing the carry through the look-ahead stages becomes dominated and therefore tree adders or parallel prefix adders are used. High speed adders depend on the previous carry to generate the present sum. In addition any decrease in delay will directly relate to an increase in throughput. It is very important to develop addition algorithm that provide high performance while reducing power. Parallel prefix adders are suitable for VLSI implementation since they depend on the use of simple cells and maintain regular connection between them. A Han-Carlson adder uses fewer number of prefix operations by a number of stages of Kogge-Stone and Brent-kung adder and thus reduces the area required by the adder circuitry. The different types of parallel prefix adders available are Kogge-Stone adder, Brent-kung adder, Sklansky adder, Han-Carlson adder, Knowles adder and Ladner-Fischer adder. Brent-Kung uses minimal number of computation nodes which yields in reduced area but structure has maximum depth which yields slight increase in latency. Slansky reduces the delay at the expense of increased fanout. Kogge-Stone achieves high speed and low fanout but produces complex circuitry with more numbers of wiring tracks. Ladner Fischer introduced a network between Sklansky and Brent-Kung which provides tradeoffs between logic levels and fanout. T. Han and D.A. Carlson presented a hybrid construction of a parallel prefix adder using two designs the Kogge-Stone construction having the best feature of higher speed and the Brent-kung construction with best feature of low area requirement. Figure-5. Stage in performing prefix computation. A Han-Carlson adder uses fewer number of prefix operations by adjusting the number of stages amongst Kogge-Stone and Brent-kung adder and thus reduces the area required by the adder circuitry. Figure-6. Han Carlson adder. Figure-7. Generalized flow from square box to diamond architecture. #### www.arpnjournals.com The number of prefix computation stages for the Han-Carlson adder is five, which is one more than the Kogge-Stone design (log216=4) for the same wordsize. However, the number of the prefix operations is fewer in the Han-Carlson design (32) than in the Kogge-Stone design (49). Thus, the Han-Carlson adder reduces the area used by the adder circuitry in return for one extra stage of delay as compared to the Kogge-Stone adder. #### 5. SIMULATION RESULTS | | | | | | | | 3,000,000 ps | |--------------------------------|---------------|------------------|--------------|-------------------------|--------------|--------------|--------------| | Name | Value | 2,999,995 ps | 2,999,996 ps | 2,999,997 ps | 2,999,998 ps | 2,999,999 ps | 3,000,000 ps | | ▶ 🛂 a[25:0] | 10101010101 | | 1010101 | 010101010101010101 | 010 | | | | ▶ ■ b(25:0) | 111111111111 | | 1111111 | 111111111111111111 | 11 | | | | ▶ ■ s(25:0) | 10101010101 | | 1010101 | 010101010101010101 | 001 | | | | ► ■ <[25:0] | 10101010101 | | 1010101 | 010101010101010101 | 010 | | | | ▶ ■ d[25:0] | 111111111111 | | 1111111 | 111111111111111111 | 11 | | | | ► W h(25:0) | 01010101010 | | 0101010 | 10 10 10 10 10 10 10 10 | 01 | | | | ▶ ₩ p(25:0) | 11111111111 | | 1111111 | 1111111111111111111 | 11 | | | | ▶ ■ g(25:0) | 10101010101 | | 1010101 | 010101010101010101 | 010 | | | | ▶ ■ e[25:0] | 10101010101 | | 1010101 | 010101010101010101 | 010 | | | | <ul><li>## f[25:0]</li></ul> | 111111111111 | | 1111111 | 111111111111111111 | 111 | | | | c1[25:0] | 10101010101 | | 1010101 | 0101010101010101 | 010 | | | | ▶ ■ d1[25:0] | 111111111111 | | 1111111 | 1111111111111111111 | 11 | | | | ▶ ₹ e1[25:0] | 10101010101 | | 1010101 | 010101010101010101 | 010 | | | | <ul> <li>¶ f1[25:0]</li> </ul> | 1111111111111 | | 1111111 | | 11 | | | | ▶ ₩ 91[25:0] | 10101010101 | | 1010101 | 010101010101010101 | 010 | | | | > N p1[25:0] | 11111111111 | | 1111111 | 111111111111111111 | 11 | | | | | | X1: 3,000,000 ps | | | | | | Figure-8. Han Carlson adder output. Figure-9. Four bit BCSE MAC unit. The proposed MAC unit was implemented to the FIR filter architecture. **Figure-10.** FIR filter using 4-Bit BCSE MAC unit and Han Carlson adder. | Name | Value | 1,999,995 ps | 1,999,996 ps | 1,999,997 ps | 1,999,998 ps | 1,999,999 ps | 2,000,000 pt | |---------------------|-------------|--------------|--------------|--------------------|--------------|--------------|--------------| | ► 1(15:0) | 10101010101 | | 10 | 10101010101010 | | | | | s[15:0] | 11111111111 | | 11 | 111111111111111 | | | | | ▶ 🕌 out1[32:0] | 01010100110 | | 0101010011 | | 010110 | | | | ▶ ₩1(16:0) | 01010100101 | | 01 | 101001010101110 | | | | | ▶ ₩2(16:0) | 01010100101 | | 01 | 10100101010110 | | | | | ■ w3(16:0) | 01010100101 | | 01 | 101001010101110 | | | | | ■ w4(16:0) | 01010100101 | | 01 | 101001010101 | | | | | ▶ <b>1</b> k1[23:0] | 00000001010 | | 000000 | 010101001010101011 | 0 | | | | ► W k2[23:0] | 00000001010 | | 000000 | 010101001101010101 | 1 | | | | ▶ ¾ k3[23:0] | 00000001101 | | 000000 | 011010100101010101 | 1 | | | | ► ¶ <1[7:0] | 10101010 | | | 10101010 | | | | | ► ■ c2(7:0) | 10101010 | | | 10101010 | | | | | ► ■ c3(7:0) | 11111111 | | | 111111111 | | | | | ► ■ c4(7:0) | 11111111 | | | 11111111 | | | | | ■ d1[7:0] | 01010110 | | | 01010110 | | | | | ■ d3(7:0) | 01010110 | | | 01010110 | | | | Figure-11. 16-bit Vedic MAC unit. Figure-12. Modified Booth MAC unit. | | Table-1. | Eval | luation | MAC | units. | |--|----------|------|---------|-----|--------| |--|----------|------|---------|-----|--------| | MAC unit | Delay (ns) | No. of slices | No. of LUTs | | |---------------------------------------------------------------|------------|---------------|-------------|--| | Vedic multiplier and Han<br>Carlson adder | 49.349 | 461 | 802 | | | Modified recoded Booth<br>multiplier and Han<br>Carlson adder | 23.683 | 44 | 79 | | | Two bit BCSE multiplier and Han Carlson Adder | 43.832 | 406 | 744 | | | Four bit BCSE multiplier and Han Carlson Adder | 21.563 | 69 | 121 | | ### 6. CONCLUSIONS MAC unit is the key module of any DSP processor. So, improving its performance can improve the overall efficiency of the system. Future work can be concentrated on area efficiency of the proposed MAC unit. This MAC unit can be implemented in a DSP processor and can obtain better efficiency for a processor. # ARPN Journal of Engineering and Applied Sciences © 2006-2016 Asian Research Publishing Network (ARPN). All rights reserved #### www.arpnjournals.com #### REFERENCES - [1] Chang, Chip-Hong, Jiangmin Gu, and Mingyan Zhang. 2004. Ultra low-voltage low-power CMOS and compressors for fast arithmetic circuits. Circuits and Systems I: Regular Papers, IEEE Transactions on. 1985-1997. - [2] Tung Thanh Hoang; Sjalander, M.; Larsson-Edefors, P. 2010. A High-Speed, Energy-Efficient Two-Cycle Multiply- Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit. Circuits and Systems I: Regular Papers, IEEE Transactions on. - [3] Chen Ping-hua; Zhao Juan. 2009. High-speed Parallel 32×32-b Multiplier Using a Radix-16 Booth Encoder. Intelligent Information Technology Application Workshops, 2009.IITAW '09. Third International Symposium on. - [4] Ravi. T. 2015. Design and performance analysis of ultra low power RISC processor using hybrid drowsy logic in CMOS technologies. International Journal of Applied Engineering Research (IJAER). 10(2): 4287-4296. - [5] Rajput, R.P.; Swamy, M.N.S. 2012. High Speed Modified Booth Encoder Multiplier for Signed and Unsigned Numbers. Computer Modelling and Simulation (UKSim), 2012 UKSim 14th International Conference on. - [6] S. Ranjith, T.Ravi, P.Umarani, R.Arunya. 2014. Design of CNTFET based sequential circuits using fault tolerant reversible logic. International Journal of Applied Engineering Research. 9(24): 25789-25804. - [7] Jaina D.; Sethi K.; Panda R. 2011. Vedic Mathematics Based Multiply Accumulate Unit. Computational Intelligence and Communication Networks (CICN), 2011 International Conference on. - [8] Aliparast, Peiman, Ziaadin D. Koozehkanani, and FarhadNazari. 2013. An Ultra High Speed Digital 4-2 Compressor in 65-nm CMOS. International Journal of Computer Theory and Engineering. 5(4). - [9] N. Weste and David Harris. 2008. CMOS VLSI Design- A Circuits and System Perspective. Pearson Education. - [10] Chandra Mohan U. 2003. Low Power Area Efficient Digital Counters. Proceedings of the 7th VLSI Design and Test Workshops, VDAT. - [11] Narendra C P and Ravi K M Kumar. 2014. Efficient Comparator based Sum of Absolute Differences Architecture for Digital Image Processing Applications. Foundation of Computer Science, New York, USA, International Journal of Computer Applications.