16×16 Bits 4 stage Pipelined Multiplier

This tutorial is dedicated to Mr. Patrick Leung. It was him who taught me the very basics of VHDL and the fundamentals of digital design. It was with him that I found my passion in VHDL, digital design and engineering in general. You will be always remembered, Mr. Leung!

This tutorial is about implementing a 4 stage pipelined 16×16 bits multiplier, instead of using one of the FPGA multipliers. It takes 4 clock cycles to yield a product from the very instant the multiplier inputs are latched by the D-flip flop.

Algorithm Implementation

These are the stages needed for implementing this 4 stages multiplier:

  1. We break the 16 bits of the multiplier and multiplicand into 4 bits nibbles.
  2. At the first clock cycles we generate all partial products.
  3. In the second clock cycle we sum all products generated from same multiplier nibble
  4. At the second stage we sum the first two partial products, and the last two partial products.
  5. at the fourth cycle, we sum up the two partial sums that we generated in third clock cycle.

The power point below summaries the algorithm steps described above. It was takes from Mr. Leung lectures slides in summer 2010.

multiplierFigure1. summary of the 4 stage 16×16 pipelined multiplier, from Mr.Leung lectures

VHDL code

--Author: F. Mabrouk

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.numeric_std.ALL;
USE ieee.std_logic_unsigned.ALL;
 
ENTITY mult_system_4_stage IS PORT (
  														
  clk	   :   IN  std_logic;							-- Multiplier clock
  en       :   IN  std_logic;							-- multiplier clocking enable,active high 
  port_A   :   IN  std_logic_vector(15 downto 0); -- multiplicant input
  port_B   :   IN  std_logic_vector(15 downto 0); -- multiplier input
  port_out :   OUT std_logic_vector(31 downto 0)); -- product output
 
END mult_system_4_stage;

ARCHITECTURE assign3 OF mult_system_4_stage IS

SIGNAL SignFlag_0,SignFlag_1,SignFlag_2,SignFlag_3:std_logic; --define the type of input signflag

SIGNAL Reg_A: std_logic_vector(15 downto 0); -- data from input portA is stored in Reg_A
SIGNAL Reg_B: std_logic_vector(15 downto 0); -- data from input portB is stored in Reg_b

SIGNAL Reg1_A: std_logic_vector(15 downto 0); -- data from Reg_A is stored in Reg1_A afer checking 15th bit and taking 2's compliment
SIGNAL Reg1_B: std_logic_vector(15 downto 0); -- data from Reg_B is stored in Reg1_b afer checking 15th bit and taking 2's compliment


--Each generated product term is stored in a register    
SIGNAL P0_Reg2, P1_Reg2,  P2_Reg2, P3_Reg2   : std_logic_vector(7 downto 0);        
SIGNAL P4_Reg2, P5_Reg2,  P6_Reg2, P7_Reg2   : std_logic_vector(7 downto 0); 
SIGNAL P8_Reg2, P9_Reg2,  P10_Reg2, P11_Reg2 : std_logic_vector(7 downto 0); 
SIGNAL P12_Reg2,P13_Reg2, P14_Reg2,P15_Reg2  : std_logic_vector(7 downto 0);  

--Sums are stored in registers3
SIGNAL Sum0_Reg3: std_logic_vector(19 downto 0);
SIGNAL Sum1_Reg3: std_logic_vector(19 downto 0);
SIGNAL Sum2_Reg3: std_logic_vector(19 downto 0);
SIGNAL Sum3_Reg3: std_logic_vector(19 downto 0);

BEGIN

MULTIPLIER: process(clk,en)

BEGIN

if ((clk'event and clk = '1') and (en = '1')) then

Reg_A<=port_A;
Reg_B<=port_B;

--Store value of portA input if positive, else take 2's complement of portA
-- and store it in Reg1_A register
if(Reg_A(15) = '1') then 
Reg1_A<=(not Reg_A) + 1;
else
Reg1_A<=Reg_A;
end if;

--Store value of portB input if positive, else take 2's complement of portB
-- and store it in Reg1_B register
if(Reg_B(15) = '1') then 
Reg1_B<=(not Reg_B) + 1;
else
Reg1_B<=Reg_B;
end if;

--Determine if product result will be positive or negative
if (((Reg_A(15) = '1') and (Reg_B(15) = '1') )OR ((Reg_A(15) = '0') and (Reg_B(15) = '0')))then 
SignFlag_0 <= '0';
else
SignFlag_0 <= '1';
end if;

--**** partial products are generated and stored in registers2  *******************************
P0_Reg2<=Reg1_A (3  downto 0)*Reg1_B(3   downto 0); -- Multiplication Product of A0B0
P1_Reg2<=Reg1_A (3  downto 0)*Reg1_B(7   downto 4); -- Multiplication Product of A0B1 
P2_Reg2<=Reg1_A (3  downto 0)*Reg1_B(11  downto 8); -- Multiplication Product of A0B2
P3_Reg2<=Reg1_A (3  downto 0)*Reg1_B(15  downto 12); -- Multiplication Product of A0B3
P4_Reg2<=Reg1_A (7  downto 4)*Reg1_B(3   downto 0); -- Multiplication Product of A1B0
P5_Reg2<=Reg1_A (7  downto 4)*Reg1_B(7   downto 4); -- Multiplication Product of A1B1
P6_Reg2<=Reg1_A (7  downto 4)*Reg1_B(11  downto 8); -- Multiplication Product of A1B2
P7_Reg2<=Reg1_A (7  downto 4)*Reg1_B(15  downto 12);-- Multiplication Product of A1B3
P8_Reg2<=Reg1_A (11 downto 8)*Reg1_B(3   downto 0); -- Multiplication Product of A2B0
P9_Reg2<=Reg1_A (11 downto 8)*Reg1_B(7   downto 4); -- Multiplication Product of A2B1  
P10_Reg2<=Reg1_A(11 downto 8)*Reg1_B(11  downto 8); -- Multiplication Product of A2B2
P11_Reg2<=Reg1_A(11 downto 8)*Reg1_B(15  downto 12); -- Multiplication Product of A2B3
P12_Reg2<=Reg1_A(15 downto 12)*Reg1_B(3  downto 0);  -- Multiplication Product of A3B0
P13_Reg2<=Reg1_A(15 downto 12)*Reg1_B(7  downto 4);  -- Multiplication Product of A3B1
P14_Reg2<=Reg1_A(15 downto 12)*Reg1_B(11 downto 8);  -- Multiplication Product of A3B2
P15_Reg2<=Reg1_A(15 downto 12)*Reg1_B(15 downto 12); -- Multiplication Product of A3B3

SignFlag_1<=SignFlag_0;            -- here we propagate the number value sign 

--Storing partial sums in Register 3, 20 bit adder	
Sum0_Reg3<= ("000000000000"&P0_Reg2)+("00000000"&P1_Reg2&"0000")+
	   ("0000"&P2_Reg2&"00000000")+(P3_Reg2&"000000000000");
             
Sum1_Reg3<= ("000000000000"&P4_Reg2)+("00000000"&P5_Reg2&"0000")+
           ("0000"&P6_Reg2&"00000000")+(P7_Reg2&"000000000000");

Sum2_Reg3<= ("000000000000"&P8_Reg2)+("00000000"&P9_Reg2&"0000")+
	    ("0000"&P10_Reg2&"00000000")+(P11_Reg2&"000000000000");

Sum3_Reg3<= ("000000000000"&P12_Reg2)+("00000000"&P13_Reg2&"0000")+
	    ("0000"&P14_Reg2&"00000000")+(P15_Reg2&"000000000000");

SignFlag_2<=SignFlag_1;    -- here we propagate the number value sign 

--final sum--
if(SignFlag_2='1') then
port_out<=(NOT (("00000000"&(("0000"&Sum0_Reg3) + (Sum1_Reg3&"0000")))+
		((("0000"&Sum2_Reg3) + (Sum3_Reg3&"0000"))&"00000000")) + 1);
else
port_out<=(("00000000"&(("0000"&Sum0_Reg3) + (Sum1_Reg3&"0000")))+
		((("0000"&Sum2_Reg3) + (Sum3_Reg3&"0000"))&"00000000"));
end if;

END IF;	
END PROCESS MULTIPLIER;

END assign3;	

Vivado Simulation

The code above was simulated with a test bench and it shows indeed that each multiplication takes 4 clock cycles from the very instant an input is latched (at rising edge.) You can verify these by looking a the timings below.

testbenchFigure2. Simulation results generated on Xilinix Vivado

 

If you like this tutorial, give us a like below and if you have any questions or feedback, feel free to drop us a message in the comment section.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.