High Speed IIR Filter Design With DSP Slices

In his post, I will talk about how to optimize the speed of the IIR filter designed in the previous posting. This new 6th order IIR can run at a clock speed of 400 MHz by using DSP slices.

The filter is structured as three cascaded biquad sections, optimized for low latency and high throughput. By leveraging the FPGA’s built-in DSP blocks for multiply-accumulate operations and applying deep pipelining techniques.

IIR difference equation

1. Direct‑form difference equation (6th order)

y[n] = Σ b_k x[n-k] − Σ a_k y[n-k]

2. Transfer function

H(z)=Y(z)/X(z)=…

3. Cascade of three biquads

Product of three biquad sections

4. Per‑biquad difference equation

Per‑biquad difference equation

Implementation

As per the above Biquad difference equation, we need 5 multiplication-Addition operations plus another subtraction. To perform these math operations efficiently, I used Xilinix IpCore MultAdd and Adder IpCore. I will explain later how these IPCores are configured so that Vivado tool would use DSP48 slices are used. The VHDL code below shows the implementation of a single IIR Biquad. Each Biquad implementation needs 5 DSP slices to be used for multiplication-addition operations and another DSP slices to be used for subtraction. For a sixth order IIR, we need a total of 18 DSP slices (6 DSP slices per Biquad).

It should be noted here that each multiplication addition operation took THREE clock cycles, and an addition/subtraction takes Two clock cycles. In total it takes 21 clock cycles for complete a single Biquad computation, which include clocks needed for capturing the input data and shifting the incoming samples. With a system clock of 400 MHz, it took 52.5ns: as seem in the image below, and that is a total of 21 clock cycles per Biquad. For three biquads, it takes 63 clock cycles.

Figure1: Time it takes to process one sample per Biquad

-- Process for updating the filter with pipelined calculations
     BiQAUD_PROCESS:process(clk, resetn)
     
     variable BiQuad_WaitClkCounter1 : integer range 0 to 4 := 0;  -- 3-clock cycle counter
     variable BiQuad_WaitClkCounter2 : integer range 0 to 4 := 0;  -- 3-clock cycle counter
     variable BiQuad_WaitClkCounter3 : integer range 0 to 4 := 0;  -- 3-clock cycle counter
     variable BiQuad_WaitClkCounter4 : integer range 0 to 3 := 0;  -- 3-clock cycle counter
     begin
         if resetn = '0' then
         
         -- Reset input and output registers to 0
         BiQuad_x_reg  (others => '0'));
         BiQuad_y_reg  (others => '0'));
           
         BiQuad_mul_x  (others => '0'));
         BiQuad_mul_y  (others => '0'));
  
         BiQuad_Sum_x  (others => '0'));
         BiQuad_Sum_y  (others => '0'));
         BiQuad_sample_valid_out <= '0';
         BiQuad_Output_NotShifted  '0');
         BiQuad_busy <= '0';
            
         MultAdd1_Subtract_sig <= '0';
         MultAdd2_Subtract_sig <= '0';
         MultAdd3_Subtract_sig <= '0';
         MultAdd4_Subtract_sig <= '0';
         MultAdd5_Subtract_sig <= '0';
         
         MultAdd1_sclr_sig <= '0';
         MultAdd2_sclr_sig <= '0';
         MultAdd3_sclr_sig <= '0';
         MultAdd4_sclr_sig <= '0';
         MultAdd5_sclr_sig <= '0';
         --BiQuad complete stages flags
         BiQuad_DoneStage1 <= '0';
         BiQuad_DoneStage2 <= '0';
         BiQuad_DoneStage3 <= '0';
         BiQuad_DoneStage4 <= '0';
         BiQuad_DoneStage5 <= '0';
         BiQuad_DoneStage6 <= '0';
         BiQuad_DoneStage7 <= '0';
         BiQuad_DoneStage8 <= '0';
         BiQuad_DoneStage9 <= '0';
         BiQuad_DoneStage10 <= '0';
         BiQuad_DoneStage11 <= '0';
         
         BiQuad_sample_valid_out <= '0';
         BiQuad_sample_valid_out <= '0';
         
         BiQuad_WaitClkCounter1 := 0; 
         BiQuad_WaitClkCounter2 := 0;
         BiQuad_WaitClkCounter3 := 0;
         BiQuad_WaitClkCounter4 := 0;
         
         elsif rising_edge(clk) then 
        
        ------------------BIQUAD LOGIC -------------------------
 
             ----BiQuad stage1 is started  
             if sample_valid_in = '1' then
                 BiQuad_x_reg(0) <=  signed(x_in); 
                 BiQuad_busy  <= '1';
                 BiQuad_DoneStage2 <= '1';   
             end if; 
                    
             ----BiQuad stage2 is started 
             if(BiQuad_DoneStage2 = '1') then 
                 -- Implement the equations below:
                 -- BiQuad_mul_x(0)= BiQuad_x_reg(0) * BiQuad_B_Coef(0)
                 MultAdd1_ce_sig <='1'; 
                 MultAdd1_A_In <= std_logic_vector(BiQuad_x_reg(0));
                 MultAdd1_B_In <= std_logic_vector(BiQuad_B0_Coef);  
                 MultAdd1_C_In  '0');
                 MultAdd1_Subtract_sig <= '0';
                 -- BiQuad_mul_y(1)= BiQuad_y_reg(1) * BiQuad_A_Coef(1)
                 MultAdd2_ce_sig <='1'; 
                 MultAdd2_A_In <= std_logic_vector(BiQuad_y_reg(1)); 
                 MultAdd2_B_In <= std_logic_vector(BiQuad_A1_Coef); 
                 MultAdd2_C_In  '0');
                 MultAdd2_Subtract_sig <= '0';
                 BiQuad_DoneStage2 <= '0';   --reset BiQuad stage2 flag
                 BiQuad_DoneStage3 <= '1';  
              end if; 
                  
                 --BiQuad stage3 is started  
                 if(BiQuad_DoneStage3 ='1') then 
                     BiQuad_WaitClkCounter1 :=  BiQuad_WaitClkCounter1 +1;   --start counter
                     if(BiQuad_WaitClkCounter1 = 4) then 
                         BiQuad_mul_x(0)  <= signed(MultAdd1_P_Out);
                         BiQuad_mul_y(1)  <= signed(MultAdd2_P_Out);
                 
                         BiQuad_WaitClkCounter1 := 0; 
                         BiQuad_DoneStage3 <='0'; 
                         BiQuad_DoneStage4 <='1';    --BiQuad stage3 is completed 
                     end if; 
				 end if;
                 
                 --BiQuad stage4 is started  
                 if(BiQuad_DoneStage4 ='1') then 
                     -- BiQuad_mul_x(1)= BiQuad_x_reg(1) * BiQuad_B_Coef(1) + BiQuad_mul_x(0)
                     MultAdd3_ce_sig <='1'; 
                     MultAdd3_A_In <= std_logic_vector(BiQuad_x_reg(1)); 
                     MultAdd3_B_In <= std_logic_vector(BiQuad_B1_Coef);
                     MultAdd3_C_In <= std_logic_vector(BiQuad_mul_x(0)); 
                     MultAdd3_Subtract_sig <= '0';
                     -- BiQuad_mul_y(2)= BiQuad_y_reg(2) * BiQuad_A_Coef(2) + BiQuad_mul_y(1)
                     MultAdd4_ce_sig <='1'; 
                     MultAdd4_A_In <= std_logic_vector(BiQuad_y_reg(2));  
                     MultAdd4_B_In <= std_logic_vector(BiQuad_A2_Coef);
                     MultAdd4_C_In <=std_logic_vector(BiQuad_mul_y(1));
                     MultAdd4_Subtract_sig <= '0';
                   
                     BiQuad_DoneStage4 <='0'; 
                     BiQuad_DoneStage5 <='1';    --BiQuad stage4 is completed 
                    
                 end if; 
                -- BiQuad stage5 is started  
                 if(BiQuad_DoneStage5 ='1') then             
                     BiQuad_WaitClkCounter2 :=  BiQuad_WaitClkCounter2 +1;   --start counter
                     if(BiQuad_WaitClkCounter2 = 4) then  
                         BiQuad_mul_x(1)<= signed(MultAdd3_P_Out);
                         BiQuad_mul_y(2) <= signed(MultAdd4_P_Out);
                         
                         BiQuad_WaitClkCounter2 := 0; 
                         BiQuad_DoneStage5 <='0'; 
                         BiQuad_DoneStage6 <='1';    --BiQuad stage5 is completed 
                     end if; 
                 end if; 
                 
                 --BiQuad stage6 is started 
                 if(BiQuad_DoneStage6 ='1') then  
                             
                     -- BiQuad_mul_x(2)= BiQuad_x_reg(2) * BiQuad_B_Coef(2) + BiQuad_mul_x(1)
                     MultAdd5_ce_sig <='1'; 
                     MultAdd5_A_In <= std_logic_vector(BiQuad_x_reg(2));  
                     MultAdd5_B_In <= std_logic_vector(BiQuad_B2_Coef);
                     MultAdd5_C_In <= std_logic_vector(BiQuad_mul_x(1));      
                     MultAdd5_Subtract_sig <= '0';
                   
                     BiQuad_DoneStage6 <='0'; 
                     BiQuad_DoneStage7 <='1';   
				 end if; 
               
                 ---BiQuad stage7 is started
                 if(BiQuad_DoneStage7 ='1') then
                     BiQuad_WaitClkCounter3 :=  BiQuad_WaitClkCounter3 +1;   --start counter
                     if(BiQuad_WaitClkCounter3 = 4) then 
                         BiQuad_mul_x(2)<= signed(MultAdd5_P_Out);
                         MultAdd5_ce_sig <='0'; 
             
                         BiQuad_WaitClkCounter3 :=0; 
                         BiQuad_DoneStage7 <='0'; 
                         BiQuad_DoneStage8 <='1';   --BiQuad stage7 is completed 
                     end if;    
                 end if; 
             
                 ---BiQuad stage8 is started    
                 if(BiQuad_DoneStage8 ='1') then
                     Adder1_A_In <= std_logic_vector(BiQuad_mul_x(2)); 
                     Adder1_B_In<=std_logic_vector(BiQuad_mul_y(2)); 
                     BiQuad_DoneStage8 <='0';
                     BiQuad_DoneStage9 <='1';
			     end if; 
              
                 ---BiQuad stage9 is started 
                 if(BiQuad_DoneStage9 ='1') then 
                     BiQuad_WaitClkCounter4 :=  BiQuad_WaitClkCounter4 +1;   
                     if(BiQuad_WaitClkCounter4 = 3) then 
                         BiQuad_Output_NotShifted <=  signed(Adder1_S_out);                       
 
                         BiQuad_WaitClkCounter4 :=0; 
                         BiQuad_DoneStage9 <='0';
                         BiQuad_DoneStage10 <='1';
                     end if;
                 end if; 
                 --BiQuad stage10 is started 
                 if(BiQuad_DoneStage10 ='1') then  
                     BiQuad_y_reg(1) <= resize(shift_right(BiQuad_Output_NotShifted,SCALE),X_WIDTH);   --scale down by 2^24
                     BiQuad_Output_Shifted <= resize(shift_right(BiQuad_Output_NotShifted,SCALE),X_WIDTH);   --BiQuad_Output_Shifted;
                     BiQuad_y_reg(2) <= BiQuad_y_reg(1);
               
                 for i in 1 to 2 loop
                     BiQuad_x_reg(i) <= BiQuad_x_reg(i-1);
                 end loop;
                
                 BiQuad_sample_valid_out <= '1';
 
                 BiQuad_busy  <= '0';
  
                 BiQuad_DoneStage10 <='0';
                 BiQuad_DoneStage11 <='1';
 
                 end if; 
              
                 ---BiQuad stage11 is started 
                 if(BiQuad_DoneStage11 ='1') then 
                     BiQuad_sample_valid_out <= '0';
                     BiQuad_busy  <= '0';
                     BiQuad_DoneStage11 <='0'; 
                 end if; 
         end if; 
        
end process;
BiQuad_out  <= BiQuad_Output_Shifted(15 downto 0) when (BiQuad_Output_Shifted  < -32768)
                                             AND (BiQuad_Output_Shifted<= 32767) else
                                             to_signed(-32768,16) when (BiQuad_Output_Shifted < -32768) else
                                             to_signed(32767,16) when (BiQuad_Output_Shifted< 32767);
sample_valid_out <=BiQuad_sample_valid_out; 
end Behavioral;

MultAdd and Adder IP Cores Configurations

Below are screen shots showing how we can configure the MultAdd and Adder IpCores.

Figure2: MultAdd Ipcore Configuration
Figure3: Adder IPcoore Configuration 1-2
Figure4: Adder Ipcore Configuration 2-2

The image below shows the output of the filter: signal Dac_Out_Sig is the data for a filtered 10KHz signal sent to a DAC. You can notice that the input signal is captured with some noise, but after the filter, all the high frequency noise is removed..

Figure5: Real time data taken with ILA showing 10KHz frequency as an input and output of the filter

Timing Summary Report

Screen shot below shows the timing reports generated by Vivado for this IIR design. The design was implemented on Zync UntraScale development board ZCU102 which has this FPGA part number: xczu9eg-ffvb1156-1-e.

Figure6: Timing Report

The image below shows the utilization report of this IIR filter implementation. In terms of DSP slices: it used 18 DSB slices, 6 DSP slices per Biquad filter: 5 DSP slices were used for multiplication addition operations and one DSP used for the subtraction operation as per the form I equation for an IIR filter.

Figure7: Resources Utilization

Source Code

If anyone is interested in the source code, feel free to message me in the comments.