Clock Domain Crossing Violations in FPGA, Part 2

Clock Domain Crossing — CDC — is one of those topics that looks simple on paper and then quietly destroys your hardware bring-up. You read about two-flop synchronizers, nod along, and move on. Then you build a real multi-clock design and spend days chasing bugs that never showed up in simulation.

This article walks through three CDC scenarios using a real FIR filter project as the backdrop. The first is a hypothetical — a very easy mistake to make that clearly illustrates the core problem. The second and third are real bugs encountered during development, both caused by the same class of thinking error. By the end, the pattern behind all three should be clear.

Description of the two Project Designs

The first design is a real-time FIR low-pass filter on a Xilinx FPGA. A DDS (Direct Digital Synthesis) core generates a swept-frequency cosine signal, which is decimated down to a 500 kHz sample rate and fed into a 25 kHz FIR filter. The filtered output is written to an AD5445 DAC so the filter’s frequency response can be observed in time-domain on an oscilloscope, and in frequency-domain on a spectrum analyzer.

The second design is similar to the fist design except that the DDS Ipcore is replaced with an ADC: ADC1173 that samples data coming out of an external frequency generator, which is used to perform a manual frequency sweep. The ADC samples data at 10MHz and the sampled data get decimated to 500KHz before it is sent to the FIR.

A PLL generates three clocks from the board input. The 200 MHz clock drives the DDS engine, the FIR core, and the DAC state machine. The 10 MHz clock goes out to the ADC interface pin.

clk200MHz  →  DDS engine · FIR core · DAC state machine
clk10MHz   →  ADC clock sampling clock
clk100MHz  →  General logic (reserved)

The signal flow through the design looks like this:

  DDS Compiler (200 MHz) in first design / ADC (10MHz) in the second design
      │
  Decimator ÷400  →  500 kHz sample rate
      │
  ── potential CDC boundaries ──
      │
  FIR Low-Pass Filter  (Xilinx AXI-Stream IP · 200 MHz)
      │
  DAC State Machine  →  AD5445 DAC  →  Oscilloscope/spectrum analyzer

Scenario 1: Fast Data, Slow Sampler, No Handshake

Hypothetical · Multi-bit bus · No handshake

Illustrative Scenario
This is a hypothetical case designed to illustrate a common CDC mistake clearly. It is easy to imagine arriving at this design and not immediately seeing the problem.

Imagine the DDS compiler produces an 8-bit sample on every cycle of its 200 MHz clock. A separate process running at 10 MHz decimates this output — sampling it once every 20 cycles — and produces a 500 kHz data stream that gets passed up to the top level for filtering. No handshake signal is added. The assumption is that since the 10 MHz process is doing the decimation, everything downstream will naturally be in sync.

-- 200 MHz domain: DDS output changes every cycle
DDS_data_out <= dds_compiler_output;   -- 8-bit, 200 MHz
 
-- 10 MHz process: decimate by 20 to produce 500 kHz output
process(clk10MHz)
begin
    if rising_edge(clk10MHz) then
        if decim_cnt = 19 then
            decim_cnt     <= 0;
            decimated_out <= DDS_data_out;  -- ❌ 8-bit bus, no sync
        else
            decim_cnt <= decim_cnt + 1;
        end if;
    end if;
end process;

Why this fails

The 200 MHz clock and the 10 MHz clock are two different clock domains. Even though the 10 MHz clock is slower, it is not synchronous with the 200 MHz clock — there is no guaranteed phase relationship between their rising edges. The 10 MHz process samples the 8-bit bus at whatever point in time its own clock edge happens to fall.

The Problem
An 8-bit value transitioning from, say, 0x7F to 0x80 requires all 8 bits to change simultaneously. If the 10 MHz clock edge falls while the bus is mid-transition, individual bits get captured in different states. The result is a corrupted code — neither 0x7F nor 0x80, but some arbitrary value in between. This is the classic multi-bit CDC failure.

The failure is intermittent and unpredictable. Most of the time the clock edge falls well away from a transition and the sample is correct. Occasionally it falls close to a transition and a corrupted value slips through. In simulation, the testbench typically aligns clocks conveniently and the bug never appears. On hardware, the phase offset between the two clock domains is determined by PLL initialisation and routing — neither of which you control.

Without any handshake signal, there is a second problem layered on top: the receiving side has no way of knowing when a new valid sample has actually arrived. It is blindly grabbing values off the bus with no guarantee that what it captured is a freshly decimated sample rather than a stale one.

The Fix
Never pass a multi-bit bus directly across a clock domain boundary without a handshake. Register the data in the source domain, toggle a single-bit flag on each new sample, synchronize that flag through a two-flop synchronizer into the destination domain, and use an XOR edge detector to reconstruct a valid strobe.

-- Source domain: register data and toggle a flag
if rising_edge(clk_source) then
    data_reg    <= new_sample;
    data_toggle <= not data_toggle;  -- single bit, safe to sync
end if;

-- Destination domain: two-flop sync + XOR edge detect
if rising_edge(clk_dest) then
    sync1      <= data_toggle;
    sync2      <= sync1;
    sync2_prev <= sync2;

    -- One valid pulse per sample, guaranteed
    data_valid <= sync2 xor sync2_prev;

    if data_valid = '1' then
        captured_data <= data_reg;  -- stable by now
    end if;
end if;

Scenario 2: A 5 ns Pulse Trying to Talk to a 200 MHz World

Real Bug · Pulse too narrow to be caught

Back to the actual project. The FIR IP core has an AXI-Stream interface. To tell it a new sample is ready, s_axis_data_tvalid must be asserted high. The approach taken was to generate a pulse from a counter running off the 10 MHz clock to produce a 500 kHz enable rate. To avoid holding tvalid high for too long, the pulse was shortened — down to one cycle of the base logic. That made it 5 ns wide.

The FIR core runs at 200 MHz. Its clock period is also 5 ns.

-- 10 MHz counter generating a 500 kHz pulse
process(clk10MHz)
begin
    if rising_edge(clk10MHz) then
        if counter = 19 then
            counter <= 0;
            tvalid_pulse <= '1';   -- ❌ 5ns wide, async to 200MHz FIR
        else
            counter <= counter + 1;
            tvalid_pulse <= '0';
        end if;
    end if;
end process;

-- Passed directly to FIR core running at 200 MHz
s_axis_data_tvalid <= tvalid_pulse;  -- ❌ crossing clock domains

The Problem
A 5 ns pulse generated in the 10 MHz domain arrives asynchronously at the 200 MHz FIR core. It can land at any phase relative to the 200 MHz clock edge — it might straddle an edge and get captured, or fall entirely between two consecutive edges and be completely invisible. In practice the FIR core was receiving samples sporadically and unpredictably.

This is exactly the kind of bug that sends you down the wrong path. You check the DDS output, verify the filter coefficients, re-examine the decimation logic — and everything looks correct. The problem is not in any of those places. It is at the boundary, in a signal that is five nanoseconds wide.

The Fix
Replace the pulse with a toggle in the source domain. A toggle holds its value until the destination domain sees it — it cannot be missed regardless of phase alignment. In the 200 MHz destination domain, a two-flop synchronizer and XOR edge detector reconstruct a clean, guaranteed single-cycle valid pulse.

-- DDS_Sig_Gen: toggle on each new decimated sample
if decim_cnt = 399 then
    decim_cnt         <= (others => '0');
    ADC_Data_In_sync1 <= DDS_data_tdata;
    adc_toggle        <= not adc_toggle;  -- persists until seen
end if;

ADC_data_valid <= adc_toggle;

-- Top level: two-flop synchronizer + XOR edge detector (200 MHz)
adc_sync1      <= ADC_data_valid_Sig;
adc_sync2      <= adc_sync1;
adc_sync2_prev <= adc_sync2;

-- Fires for exactly one 200 MHz cycle per new sample
FIR_sample_valid_in <= adc_sync2 xor adc_sync2_prev;

Scenario 3: A 10 MHz State Machine Waiting for a 5 ns Signal

Real Bug · State machine too slow · Unrelated pulse generator

The DAC side had its own problem, and it was the same fundamental mistake in a different form.

When the FIR core produces a valid output sample it asserts m_axis_data_tvalid for exactly one 200 MHz clock cycle — 5 ns. The DAC state machine needed to see this signal to know when to latch the output and write it to the AD5445. The state machine process was clocked at 10 MHz, sampling its inputs once every 100 ns.

-- DAC state machine running at 10 MHz
process(clk10MHz, reset)
begin
    if rising_edge(clk10MHz) then
        case state is
            when 0 =>
                -- ❌ FIR_data_out_valid is a 5ns pulse from 200MHz domain
                -- 10MHz process checks every 100ns — almost never sees it
                if FIR_data_out_valid = '1' then
                    state <= 1;
                end if;
        end case;
    end if;
end process;

The Problem
A 5 ns pulse has roughly a 1-in-20 chance of overlapping a 10 MHz rising edge at all. In practice the DAC state machine almost never saw FIR_data_out_valid assert. It sat in its idle state indefinitely, waiting for a trigger that effectively never arrived. The DAC output was stuck.

On top of this, the DAC state machine had its own independently generated 500 kHz pulse — from a separate counter, also only 5 ns wide — to pace the DAC write cycle. Two separate timing problems were stacked on top of each other:

→
The FIR valid pulse was too short to be reliably seen by the 10 MHz process

→
The DAC write timing was driven by an independent counter with no link to actual filter output availability

The Fix
Move the DAC state machine into the same 200 MHz clock domain as the FIR core. Since both now share the same clock, FIR_data_out_valid is sampled on every rising edge — no pulse width concerns, no missed triggers.

-- DAC state machine moved to 200 MHz — same domain as FIR core
process(clk200MHz, reset)
begin
    if rising_edge(clk200MHz) then
        case state is
            when 0 =>
                -- ✓ safely sampled on every cycle, same domain
                if FIR_data_out_valid = '1' then
                    ChipSeclect_n_sig  <= '0';
                    WriteEnable_IntSig <= '0';
                    FIR_OutScaled      <= FIR_tdata_out(31 downto 20);
                    state              <= 1;
                end if;
        end case;
    end if;
end process;

The Common Thread

All three scenarios share the same root cause: not tracking which clock domain owns a signal at every point in the design. Clock frequencies get treated as labels — “this is a 200 MHz signal, that block runs at 200 MHz, they must be compatible” — rather than as physical signals with real phase relationships and real timing constraints.

CDC bugs are particularly hard to find because they are intermittent. A 5 ns pulse might be caught nine times out of ten depending on how a testbench aligns its clocks. On real hardware, the phase offset is determined by PLL initialisation, routing delays, and operating temperature — none of which are under your control.

The practical rules these scenarios illustrate:

→
A toggle outlasts a pulse. For signalling events across clock boundaries, toggle a bit rather than generating a pulse. A toggle holds its state until the other side sees it. A pulse can vanish entirely between two clock edges.

→
Two-flop synchronizers are for single bits. A multi-bit bus needs the data to be stable at the moment of sampling. Use a toggle handshake to guarantee that stability before capturing the bus.

→
Match your state machine clock to the signals it reacts to. A process clocked at 10 MHz cannot reliably catch a signal that pulses for 5 ns. If the signal lives in the 200 MHz domain, the state machine needs to live there too.

→
Same frequency does not mean same domain. Two 200 MHz clocks from different PLL output taps are asynchronous until proven otherwise in your constraints file.

→
Do not trust simulation alone for CDC. Run report_cdc in Vivado or a dedicated CDC tool. Simulation will hide these bugs. The tool will not.

CDC does not require exotic hardware or unusual conditions to cause failures. It requires two clocks and a wire between them — which describes almost every non-trivial FPGA design ever built.