FIR Filter Architectures

Finite impulse response (FIR) filters are fundamental building blocks in digital signal processing, valued for their guaranteed stability and ability to achieve exactly linear phase. While the mathematical description of an FIR filter is straightforward, a weighted sum of delayed input samples, the hardware or software architecture used to realize this computation profoundly affects performance, resource utilization, power consumption, and maximum achievable throughput. Understanding the diverse architectures available for FIR implementation enables designers to select and optimize solutions matching their specific requirements.

From the simple direct form suitable for software implementation to highly parallel systolic arrays capable of processing gigasamples per second, FIR architectures span an enormous range of complexity and capability. Advanced techniques like distributed arithmetic eliminate explicit multiplications through clever use of precomputed lookup tables, while canonical signed digit representations minimize the number of additions required. Polyphase decomposition restructures filters for efficient multirate processing, and folded architectures trade throughput for reduced hardware resources. This comprehensive exploration of FIR filter architectures provides the foundation for effective filter implementation across all application domains.

Direct Form Architectures

The direct form architecture implements the FIR convolution equation in its most straightforward manner, computing the output as a weighted sum of current and past input samples. This fundamental structure serves as the reference against which other architectures are compared and provides the basis for understanding more sophisticated implementations.

Direct Form I Structure

The direct form I FIR filter consists of a tapped delay line followed by coefficient multipliers and a summing tree or chain. For an N-tap filter, N-1 delay elements hold the most recent input samples, N multipliers apply the filter coefficients, and an adder combines the products to produce the output. Each clock cycle, a new input sample enters the delay line, the oldest sample exits, and the convolution sum is computed.

The delay line implementation significantly impacts performance and resource utilization. A shift register approach physically moves data through cascaded registers at each clock cycle, requiring N-1 write operations per sample. Circular buffer implementations instead use a pointer that advances through a fixed memory, avoiding data movement but requiring address computation. The choice depends on the implementation technology, with shift registers favoring FPGAs and circular buffers favoring processors with addressing hardware.

The coefficient multipliers dominate resource utilization in most direct form implementations. Each multiplier must compute the product of a sample (typically 12 to 24 bits) with a coefficient (often 16 to 32 bits), requiring significant silicon area or computational cycles. For fixed-coefficient filters, the multipliers can be optimized for the specific coefficient values, potentially reducing complexity compared to general-purpose multipliers.

The accumulation of multiplier outputs presents its own challenges. A simple sequential addition chain introduces significant latency and limits clock frequency. Tree structures reduce latency to logarithmic in the number of taps but increase routing complexity. Pipelined adder trees achieve high throughput by inserting registers between adder stages, trading latency for clock speed. The accumulator word width must accommodate the worst-case sum without overflow, typically requiring several extra bits beyond the input and coefficient precision.

Direct Form II Structure

Direct form II, while primarily associated with IIR filters, can be applied to FIR filters to achieve a transposed structure. In this arrangement, the input sample is multiplied by all coefficients simultaneously, and the products are accumulated through a chain of delays and additions. Each delay element holds a partial sum rather than an input sample, with consecutive stages adding successive coefficient products.

This transposed structure offers several advantages for high-speed implementation. The critical path from input to output contains only a single multiplier and adder, enabling higher clock frequencies than the direct form I structure where the critical path may span multiple additions. The fanout of the input signal to all multipliers can be pipelined, further improving timing.

The transposed form also simplifies coefficient updates in adaptive filter applications. Since each coefficient multiplies only the current input sample, changing a coefficient immediately affects the output without requiring the delay line to flush. This property is valuable in applications requiring rapid coefficient adaptation.

However, the transposed structure requires all N multiplications to occur within a single clock cycle, demanding N parallel multipliers. This parallelism may be advantageous or disadvantageous depending on available resources. The structure also requires careful attention to accumulator precision, as each stage adds a new product to an increasingly large running sum.

Linear Phase Exploitation

FIR filters with symmetric coefficients, those designed for linear phase response, offer significant optimization opportunities. When coefficients satisfy h[k] = h[N-1-k], pairs of multiplications involve identical coefficients applied to different samples. By adding the corresponding input samples before multiplication, the number of multiplications reduces by nearly half.

For an N-tap symmetric filter, only (N+1)/2 multipliers are needed instead of N. The architecture first adds pairs of samples from opposite ends of the delay line, then multiplies these sums by the shared coefficients. For odd-length filters, the center coefficient multiplies a single sample without pairing. This optimization approximately halves the multiplier count without affecting the filter response.

The pre-addition of sample pairs requires careful attention to word width. The sum of two samples requires one additional bit compared to individual samples, which propagates through the multiplication and accumulation stages. The overall savings in multipliers generally outweigh the modest increase in adder width, making symmetric exploitation nearly universal in linear phase FIR implementations.

Antisymmetric filters, where h[k] = -h[N-1-k], enjoy similar optimization through subtraction rather than addition of sample pairs. These filters, used for differentiators and Hilbert transformers, achieve the same multiplier reduction with minor architectural modifications.

Pipelining and Retiming

Pipelining inserts registers into the combinational paths of a filter, breaking long paths into shorter segments that can operate at higher clock frequencies. For direct form FIR filters, pipeline registers are typically placed after multipliers and between adder stages. The number of pipeline stages determines both the maximum clock frequency and the filter latency.

Retiming redistributes existing registers within a circuit to balance path delays without changing functionality. Applied to FIR filters, retiming can move registers from the delay line into the datapath, converting a direct form structure into a pipelined architecture. Automatic retiming tools can optimize register placement for maximum clock frequency, though manual analysis often achieves better results for regular structures like FIR filters.

The latency introduced by pipelining must be considered in system design. A fully pipelined N-tap filter with P pipeline stages introduces a latency of P clock cycles beyond the inherent filter delay. For real-time systems with tight latency constraints, the designer must balance throughput improvements against the additional delay.

Cut-set retiming, a specific retiming technique, moves registers across feed-forward cut-sets in the datapath graph. This approach systematically transforms a direct form structure into a pipelined architecture, with each cut-set crossing adding one pipeline stage. The regularity of FIR structures makes them particularly amenable to cut-set retiming optimization.

Transposed Form Architecture

The transposed form FIR architecture reverses the signal flow of the direct form, creating a structure with fundamentally different characteristics. Rather than delaying input samples and multiplying by coefficients, the transposed form multiplies the input by all coefficients simultaneously and accumulates the results through a chain of delay elements.

Structure and Operation

In the transposed form, each filter coefficient multiplies the current input sample, producing N products simultaneously for an N-tap filter. These products feed into a cascade of adders, with delay elements between stages. The first product enters the cascade directly, while subsequent stages receive new products plus delayed sums from previous stages. The final stage output is the filter result.

The signal flow processes data in the opposite direction from direct form. Where direct form delays input samples and sums weighted current values, transposed form delays partial sums and adds weighted input values. Mathematically, both produce identical outputs, as they implement the same convolution equation with rearranged computation order.

The critical timing path in transposed form contains just one multiplier and one adder, significantly shorter than the multi-adder path potentially present in direct form accumulation. This inherent pipelining makes transposed form attractive for high-frequency implementations without explicit pipeline register insertion. The structure naturally achieves throughput that direct form would require multiple pipeline stages to match.

Each delay element in the transposed structure holds a partial sum rather than an input sample. These partial sums accumulate products from multiple input samples, requiring word widths adequate to prevent overflow. The accumulation grows progressively through the chain, with early stages needing fewer bits than later stages. Tapered word widths can reduce register resources while maintaining accuracy.

Broadcast Input Distribution

The transposed form broadcasts the input sample to all coefficient multipliers simultaneously. This high fanout can limit clock frequency if the input distribution network introduces significant delay. Careful physical design and potential registering of the broadcast signal ensure that input distribution does not become the critical path.

Modern FPGAs and ASICs include dedicated routing resources that efficiently distribute high-fanout signals. Clock networks, designed for similar broadcast requirements, provide a model for input distribution in transposed form filters. Leveraging these resources enables high-frequency operation despite the inherent fanout requirements.

The simultaneous availability of the input at all multipliers enables single-cycle coefficient application. Unlike direct form where coefficients multiply samples at different ages, transposed form applies all coefficients to the same input sample. This property simplifies certain adaptive filter implementations where coefficient changes should take effect immediately.

Comparison with Direct Form

The transposed form offers inherently higher clock frequencies due to its shorter critical path. Without explicit pipelining, transposed form typically operates faster than direct form. This advantage is most pronounced for filters with many taps, where direct form accumulation paths become increasingly long.

Resource utilization differs subtly between forms. Both require the same number of multipliers and delay elements, but the delay element contents differ in word width. Direct form delays store input samples at input precision, while transposed form delays store partial sums requiring extended precision. For filters with high precision requirements, this difference can make transposed form more resource-intensive.

Symmetric coefficient exploitation is more complex in transposed form. The straightforward pre-addition of samples in direct form becomes post-addition of partial results in transposed form, requiring restructuring of the accumulation chain. While symmetric optimization remains possible, the implementation is less intuitive than in direct form.

Initial state and reset behavior also differ. Direct form requires initializing the delay line with appropriate values (typically zero), while transposed form requires initializing the partial sum registers. The transposed form may require more clock cycles to flush previous state when processing discontinuous data blocks.

Systolic Array Architectures

Systolic arrays organize computation as rhythmic data flow through regular arrays of processing elements, inspired by biological systems where blood pulses through the body carrying nutrients. Applied to FIR filters, systolic architectures achieve extreme parallelism and throughput through carefully orchestrated data movement and local computation.

Systolic Array Principles

A systolic array consists of identical processing elements (PEs) connected in a regular pattern, typically linear or two-dimensional. Data flows through the array in synchronized waves, with each PE performing a simple operation on data it receives and passing results to neighbors. The regular structure and local connectivity enable efficient implementation and high-speed operation.

Key characteristics of systolic arrays include modularity, where the same PE design replicates throughout the array; locality, where each PE communicates only with immediate neighbors; and pipelining, where data continuously flows through the array without global synchronization. These properties enable scaling to large array sizes while maintaining timing closure.

For FIR filtering, systolic arrays can process one sample per clock cycle regardless of filter length, achieving throughput limited only by clock frequency rather than computational resources. The parallelism is inherent in the architecture, not requiring explicit parallelization of the algorithm.

Weight-Stationary Systolic FIR

In the weight-stationary systolic architecture, filter coefficients remain fixed in the processing elements while input samples flow through the array. Each PE contains one coefficient value and performs a multiply-accumulate operation: multiplying its coefficient by the arriving sample and adding to the accumulated result from the previous PE.

Input samples enter the array one per clock cycle at the first PE and propagate through the array, visiting each coefficient in sequence. Partial sums flow in the same direction, accumulating products as they pass through successive PEs. The final PE outputs the complete filter result for each input sample.

The weight-stationary architecture requires N clock cycles for a sample to traverse an N-tap filter, introducing latency equal to the filter length. However, new samples enter every clock cycle, maintaining throughput of one sample per clock. The array processes N samples simultaneously, each at a different stage of filtering.

Coefficient updates require loading new values into all PEs, a process that may interrupt continuous filtering. For adaptive filters requiring frequent coefficient changes, the weight-stationary architecture may need additional mechanisms for background coefficient loading. The architecture is most suitable for fixed-coefficient filters or those with infrequent adaptation.

Output-Stationary Systolic FIR

The output-stationary systolic architecture broadcasts input samples to all PEs simultaneously while partial results remain stationary until complete. Each PE handles one coefficient, receiving the broadcast input, multiplying by its coefficient, and accumulating into a local sum. After N input samples, each PE contains a complete output sample.

This architecture inverts the data flow pattern of weight-stationary design. Rather than samples flowing through the array, samples arrive everywhere simultaneously while results accumulate locally. The broadcast requirement limits scalability but simplifies the accumulation structure.

Output-stationary designs achieve lower latency than weight-stationary for small filters, as the first output appears after receiving N inputs rather than after N propagation delays. For real-time applications with strict latency requirements, this characteristic may be decisive.

The broadcast input distribution represents the main scaling challenge. High fanout to many PEs can limit clock frequency and consume routing resources. Hybrid approaches distribute input through a tree structure, balancing broadcast efficiency against pure output-stationary operation.

Two-Dimensional Systolic Arrays

Two-dimensional systolic arrays extend the linear concept to process multiple data streams or implement multiple filters simultaneously. Samples can flow in one dimension while coefficients or partial results flow in another, enabling complex filtering operations within a single array structure.

One application of 2D systolic arrays is parallel filtering of multiple input channels. Each row of PEs implements one filter while columns handle different channels. The regular structure efficiently utilizes FPGA or ASIC resources while achieving throughput proportional to array size.

Matrix-vector multiplication, which generalizes FIR filtering, maps naturally to 2D systolic arrays. This formulation enables efficient implementation of multiple filters with different coefficients operating on the same input, as required in filter bank applications. The 2D structure processes all filters simultaneously with single-sample latency.

The interconnect complexity of 2D arrays can challenge physical implementation. Regular mesh connections between neighboring PEs must handle the required data widths while meeting timing constraints. Modern FPGAs include dedicated routing for such regular structures, making 2D systolic implementations increasingly practical.

Systolic Array Efficiency

Systolic arrays achieve high computational efficiency by keeping all PEs active on every clock cycle. Unlike sequential implementations where hardware utilization varies with algorithm phase, systolic arrays maintain constant utilization once the pipeline fills. This efficiency translates to superior throughput per unit of hardware resources.

Power consumption in systolic arrays benefits from local data movement. Rather than global buses transferring data across the chip, systolic architectures move data only between adjacent PEs. This locality reduces capacitive loading and switching activity, improving energy efficiency compared to architectures with longer data paths.

The regular structure of systolic arrays simplifies timing analysis and physical design. Identical paths between PEs enable straightforward timing closure, while the repetitive layout can leverage array-based placement tools. These characteristics reduce design effort and improve design quality compared to irregular architectures.

Scalability is inherent in systolic design. Adding more PEs increases filter length or parallel channels without architectural changes. The same PE design works for small and large arrays, enabling design reuse across product families with different performance requirements.

Distributed Arithmetic

Distributed arithmetic (DA) transforms the multiplication and accumulation operations of FIR filtering into table lookups and additions, completely eliminating explicit multipliers. This technique precomputes all possible partial products and stores them in lookup tables, replacing runtime multiplication with memory access. For fixed-coefficient filters, distributed arithmetic can dramatically reduce resource utilization and power consumption.

Fundamental Concept

Consider the inner product of coefficient vector h with input vector x, the fundamental FIR computation. Expressing each input sample in binary as a sum of weighted bits, the inner product becomes a sum of inner products of the coefficient vector with single-bit vectors. Each single-bit inner product selects certain coefficients (those corresponding to 1 bits) for summation.

For N coefficients and B-bit input samples, there are 2^N possible single-bit inner products, corresponding to all possible combinations of which coefficients to include. By precomputing these sums and storing them in a lookup table, the inner product reduces to B table lookups (one per bit position) and a weighted sum of the lookup results.

The lookup table address is formed by collecting one bit from each input sample. For bit position b, the address concatenates the bth bit of all N samples, forming an N-bit address that indexes into the 2^N-entry table. The table output is the sum of coefficients corresponding to 1 bits in the address.

The weighted accumulation of lookup results accounts for the binary weights of each bit position. Results from higher-order bits contribute more heavily than lower-order bits, with bit position b weighted by 2^b. This accumulation typically proceeds serially from LSB to MSB, with shifting and adding operations combining the partial results.

Serial Distributed Arithmetic

The serial distributed arithmetic architecture processes one bit position per clock cycle, requiring B cycles to complete one filter output. Each cycle, a new table lookup occurs using the current bit slice across all input samples, and the result is shifted and accumulated with previous partial results.

The accumulator shifts left (doubles) before adding each new lookup result, implementing the binary weighting. For two's complement inputs, the MSB lookup result is subtracted rather than added, accounting for the negative weight of the sign bit. This serial accumulation produces the correct inner product after processing all bit positions.

Resource utilization in serial DA includes one lookup table of size 2^N entries and one accumulator with shift capability. The lookup table size grows exponentially with filter length, potentially becoming impractical for long filters. Table partitioning techniques address this limitation by decomposing the filter into smaller sections, each with its own table.

Throughput of serial DA is limited to one output per B clock cycles, slower than direct form implementations. This trade-off favors applications where the sample rate is relatively low, the filter is short enough for reasonable table sizes, and multiplier resources are scarce or expensive. Audio processing often fits these criteria well.

Parallel Distributed Arithmetic

Parallel DA performs all bit position lookups simultaneously, producing filter outputs in a single clock cycle. This requires B parallel lookup tables, each addressed by a different bit slice of the input samples. The results are weighted by their bit positions and summed to produce the final output.

The weighted summation in parallel DA uses a combination of shifters and an adder tree. Each lookup result shifts left by its bit position (0 for LSB through B-1 for MSB) before entering the adder tree. The sign bit path includes negation before addition. The complete structure produces throughput equal to one sample per clock cycle.

Resource requirements multiply compared to serial DA: B tables of 2^N entries each, plus the shifting and addition logic. This substantial resource investment is justified when high throughput is essential and multipliers are the limiting resource. Modern FPGAs with abundant block RAM make parallel DA increasingly practical.

Hybrid serial-parallel implementations offer intermediate trade-offs. Processing K bits per cycle requires K tables and K cycles per output, where K divides B. The designer can select K to balance throughput requirements against available memory resources.

Table Partitioning

For filters with many taps, the 2^N table size becomes prohibitive. Partitioning the filter into smaller groups addresses this limitation by using multiple smaller tables whose results are summed. For a 64-tap filter partitioned into 8 groups of 8 taps each, eight 256-entry tables replace one impractical 2^64-entry table.

Partitioning introduces additional adders to sum the partial results from each table. The total computation becomes K table lookups plus K-1 additions per bit position, where K is the number of partitions. The additional additions represent modest overhead compared to the dramatic table size reduction.

Optimal partition size balances table size against adder overhead. Very small partitions use minimal memory but require many additions. Very large partitions minimize additions but may exceed available memory. Analysis for specific target architectures determines the optimal partition size.

Symmetric coefficient filters offer partitioning advantages. By adding symmetric sample pairs before table lookup, an N-tap symmetric filter behaves like an (N+1)/2-tap filter for table sizing purposes. This effectively halves the apparent filter length, reducing table requirements or enabling larger partitions.

Offset Binary Coding

Two's complement representation complicates distributed arithmetic due to the negative weight of the sign bit. Offset binary coding transforms inputs to eliminate negative weights, simplifying the accumulation logic. The transformation adds a fixed offset to convert two's complement values to unsigned values.

With offset binary coding, all bit positions have positive weights, and the accumulation uses only additions without the sign bit subtraction. A final correction subtracts the offset contribution from the result. This correction is a constant for fixed-coefficient filters and can be precomputed.

The table contents also require adjustment for offset binary coding. Rather than storing coefficient subset sums directly, the tables store values adjusted for the offset transformation. This adjustment occurs during table generation, not at runtime, adding no computational overhead.

Offset binary coding simplifies hardware at the cost of increased numerical range. The offset expands the effective value range, requiring additional bits in the accumulator. For most filters, this overhead is minor compared to the simplification of accumulation logic.

Canonical Signed Digit Representation

Canonical signed digit (CSD) representation expresses numbers using digits from the set {-1, 0, 1}, guaranteeing that no two consecutive digits are non-zero. This sparse representation minimizes the number of non-zero digits and consequently the number of additions required to implement multiplication by a constant. For fixed-coefficient FIR filters, CSD representation can substantially reduce computational complexity.

CSD Fundamentals

Any integer can be represented in CSD form with digits that are never adjacent when non-zero. This property arises from the recoding rule that converts binary sequences like "011" to "101" in signed digit form (meaning 1, 0, -1 for digit values). The recoding reduces the number of non-zero digits, often significantly.

On average, CSD representation has one-third non-zero digits, compared to one-half for standard binary. A 16-bit coefficient typically has about 5-6 non-zero CSD digits versus 8 non-zero binary bits. Each non-zero digit requires one addition or subtraction, so CSD reduces the additions needed to implement coefficient multiplication.

The CSD form of any number is unique, unlike other signed digit representations. This uniqueness simplifies tools and analysis, as there is exactly one optimal CSD encoding for each coefficient value. Conversion algorithms efficiently transform binary or decimal coefficients to CSD form.

For FIR implementation, each coefficient multiplication becomes a sequence of shift-and-add operations corresponding to the non-zero CSD digits. A digit value of +1 at position k contributes (input shifted left by k); a digit value of -1 contributes the negative of this shifted value. Summing all non-zero digit contributions yields the product.

Multiplierless Architectures

CSD enables FIR filters without dedicated multipliers, using only shifters and adders. Each coefficient multiplication decomposes into additions of shifted input values, determined by the CSD representation. The regular structure of FIR filtering provides opportunities to share intermediate shift-and-add operations across coefficients.

Consider a coefficient with CSD representation having non-zero digits at positions 0, 3, and 7. The multiplication is (input) + (input shifted left 3) + (input shifted left 7). With proper sign handling for -1 digits, any coefficient multiplication reduces to a small number of additions of power-of-two multiples of the input.

Adder graphs represent the network of additions required to implement all coefficient multiplications. Graph optimization techniques find minimal adder graphs that compute all required constant multiplications with shared intermediate results. A coefficient set with common factors or similar CSD patterns benefits most from such optimization.

The complexity of multiplierless FIR implementation depends on the coefficient values. Coefficients with sparse CSD representation (few non-zero digits) require fewer additions. Filter design can incorporate CSD-friendliness as an optimization criterion, selecting coefficients that meet frequency specifications while minimizing implementation complexity.

Common Subexpression Elimination

Multiple coefficients may share common shift-and-add computations, enabling reuse of intermediate results. Common subexpression elimination (CSE) identifies these shared operations and computes them once for use by multiple coefficients. This optimization can dramatically reduce the total adder count.

For example, if two coefficients both require (input + input shifted left 2), this common subexpression can be computed once and reused. The optimization extends to complex patterns involving multiple levels of addition and subtraction, potentially spanning many coefficients.

Algorithms for CSE in constant multiplication range from simple pattern matching to sophisticated graph optimization. Optimal CSE is computationally difficult for large coefficient sets, but heuristic algorithms achieve near-optimal results efficiently. Commercial and academic tools automate CSE for practical filter implementations.

The benefits of CSE are most pronounced for filters with many taps and coefficients with similar magnitudes. High-precision coefficients with many non-zero CSD digits provide more opportunities for sharing than low-precision coefficients with few non-zero digits.

CSD versus Other Representations

CSD is one of several signed digit representations used for constant multiplication. Minimum signed digit (MSD) representations may achieve fewer non-zero digits than CSD for specific numbers, though not uniquely. The choice between CSD and other representations involves trade-offs between optimization potential and algorithmic complexity.

For individual coefficients, CSD achieves optimal or near-optimal sparsity. For coefficient sets where sharing is possible, representations other than CSD might enable better overall sharing patterns. Advanced optimization considers the entire coefficient set rather than optimizing each coefficient individually.

Comparison with standard binary representation highlights CSD advantages. Binary representations of coefficients with many consecutive 1 bits are particularly inefficient; CSD converts these to sparser representations with alternating signs. For example, binary "0111" becomes CSD "1001" (meaning 1, 0, 0, -1), reducing from three non-zero digits to two.

Implementation complexity of CSD arithmetic is similar to binary arithmetic. Subtraction replaces addition for -1 digits, but the operations have identical timing and resource requirements in most implementations. The reduction in operation count directly translates to resource and power savings.

Polyphase Decomposition

Polyphase decomposition restructures an FIR filter into parallel subfilters operating at a fraction of the original rate. This powerful technique is fundamental to efficient multirate filter implementation, enabling decimation and interpolation filters that compute only the outputs actually needed. Beyond multirate applications, polyphase structures enable parallel processing architectures and efficient implementations of filter banks.

Polyphase Components

A polyphase decomposition of an N-tap FIR filter with M phases creates M subfilters, each containing N/M coefficients (assuming N is divisible by M). The subfilters, called polyphase components, contain the original coefficients subsampled by factor M with different starting positions.

Mathematically, if h[n] represents the original impulse response, the kth polyphase component contains coefficients h[k], h[k+M], h[k+2M], and so on. The original filter output can be reconstructed by upsampling each polyphase output by M and summing with appropriate delays.

The decomposition exposes parallelism inherent in the filter structure. Each polyphase component operates on a decimated version of the input, processing every Mth sample. All M components can operate simultaneously on different subsets of the input samples, enabling parallel implementation.

For analysis, the Z-transform of each polyphase component relates to the original filter transfer function through the identity H(z) = sum of z^(-k) times E_k(z^M), where E_k is the kth polyphase component. This mathematical framework enables systematic derivation of polyphase structures for various filtering configurations.

Decimation Filters

Decimation by factor M requires an anti-aliasing lowpass filter followed by downsampling. Direct implementation computes all N filter outputs, then discards M-1 out of every M, wasting computational resources on unused results. Polyphase restructuring computes only the retained outputs, reducing computation by factor M.

The polyphase decimation filter uses M subfilters corresponding to the M phases between retained outputs. Each phase processes a different alignment of input samples relative to the output timing. Only one phase is active at any output instant, but the phases cycle through their contributions to successive outputs.

Implementation typically uses a commutator structure where the input sample stream distributes to the polyphase components, which share common input history through a rotating pointer or switch. The active component for each output depends on the current decimation phase, with the M components time-multiplexing their outputs.

Resource sharing across polyphase components is natural in decimation filters. Since only one component produces output at any instant, the M components can share a single set of multipliers and accumulators. This sharing reduces hardware to approximately 1/M of a direct implementation while maintaining the same throughput.

Interpolation Filters

Interpolation by factor L inserts L-1 zeros between each input sample, then lowpass filters to remove spectral images. Direct implementation of the filter multiplies many coefficients by the inserted zeros, wasting computation. Polyphase restructuring avoids these zero-multiplications, reducing computation by factor L.

The polyphase interpolation filter decomposes into L subfilters, each producing one of the L output samples per input sample. Each subfilter operates at the input rate on the original (non-zero-inserted) samples, computing outputs that interleave to form the complete interpolated sequence.

The coefficient assignment to polyphase components follows from the positions of non-zero input samples relative to each output. For output position k, the relevant coefficients are those at positions k, k+L, k+2L, etc. in the original filter, which form the kth polyphase component.

Implementation can parallel-compute all L outputs simultaneously for maximum throughput, or time-multiplex a single subfilter structure across the L phases for minimum hardware. The choice depends on throughput requirements and available resources. Either approach eliminates zero-multiplication entirely.

Parallel Processing Applications

Polyphase decomposition enables parallel FIR processing beyond multirate applications. By decomposing a filter into M polyphase branches processing interleaved samples, the filter processes M samples per clock cycle while each branch operates at 1/M the sample rate. This parallelism increases throughput without requiring faster clock rates.

The parallel polyphase structure requires M instances of each polyphase component, processing M interleaved sample streams simultaneously. Sample routing at input distributes successive samples to different branches, while output routing combines the branch outputs in correct sequence.

This approach is valuable when single-rate implementations cannot meet throughput requirements within clock frequency constraints. Rather than pursuing aggressive timing optimization of a single datapath, the designer can use multiple moderate-speed datapaths operating in parallel.

Filter banks naturally exploit polyphase structure. Analysis filter banks decompose signals into frequency bands using modulated versions of a prototype filter, which shares a common polyphase representation. Efficient filter bank implementations compute the polyphase filtering once and combine results through a transform operation like DFT or DCT.

Folded Architectures

Folded architectures trade reduced hardware resources for lower throughput by time-multiplexing operations across fewer processing elements than the filter order. Where a fully parallel N-tap filter uses N multipliers operating simultaneously, a folded architecture might use only one multiplier performing N operations sequentially. This resource reduction is valuable when hardware is expensive or limited but throughput requirements are modest.

Folding Transformation

Folding transforms a parallel implementation into a serial or partially serial architecture by scheduling multiple operations from the original design onto shared hardware. The folding factor N determines how many original operations share each hardware instance. A folding factor of N reduces multipliers by factor N while reducing throughput by the same factor.

The folding transformation requires careful scheduling to ensure data dependencies are respected. Each operation must occur only after its input data is available, which may require additional delay elements to align data with scheduled operations. Systematic folding algorithms determine valid schedules and required delays.

Register sharing accompanies hardware folding. Where the original design used dedicated registers for each delay element, the folded design shares register resources through time-multiplexed access. The number of physical registers may be less than the filter order, with register addressing determined by the folding schedule.

Folding is the inverse of unfolding or retiming for parallelism. Where those transformations increase hardware for higher throughput, folding decreases hardware accepting lower throughput. The folded and unfolded architectures are functionally equivalent, differing only in resource/throughput trade-offs.

Sequential Architecture

The most aggressive folding produces a fully sequential architecture with a single multiply-accumulate unit processing one tap per clock cycle. An N-tap filter produces one output every N clock cycles, achieving throughput of 1/N samples per clock. This minimum-hardware implementation is suitable for low-sample-rate applications where silicon area or power dominates over throughput.

The sequential architecture consists of a single multiplier, an accumulator, coefficient memory, and sample memory. Each clock cycle, one sample and one coefficient are read from their respective memories, multiplied, and accumulated. After N cycles, the accumulator contains the filter output, which is then captured and the accumulator reset for the next output.

Control logic sequences through the N tap positions, generating memory addresses for samples and coefficients. The sample memory operates as a circular buffer, with appropriate pointer management to access samples in the correct order. Coefficient memory is typically ROM for fixed filters or RAM for adaptive filters.

The sequential architecture minimizes multiplier hardware but requires N times more memory bandwidth than a parallel implementation. Sample memory must provide a new sample each clock cycle, while coefficient memory similarly provides a new coefficient each cycle. Memory access patterns are regular and predictable, enabling efficient implementation.

Semi-Parallel Architectures

Intermediate folding factors produce semi-parallel architectures balancing resource utilization against throughput. For a folding factor F applied to an N-tap filter, the architecture uses N/F multipliers and produces one output every F clock cycles. Semi-parallel architectures offer flexible resource/throughput trade-offs.

The semi-parallel structure groups taps into N/F sections, with one multiplier handling each section. Within a section, F taps time-multiplex onto the shared multiplier over F clock cycles. Partial sums from each section accumulate to produce the final output.

Register allocation in semi-parallel architectures requires analysis of data lifetimes and access patterns. Some registers hold sample values across multiple clock cycles while multiplexing occurs; others hold partial sums during the F-cycle computation period. Optimal register allocation minimizes total storage while meeting all data flow requirements.

Semi-parallel architectures particularly suit FPGA implementation where DSP blocks (multipliers) are the constrained resource. By folding to match available DSP blocks, the designer maximizes filter capability within the FPGA's resources. The flexibility of programmable logic easily accommodates the control logic for various folding factors.

Memory-Based Architectures

Memory-based FIR architectures store both samples and coefficients in RAM, using memory bandwidth rather than dedicated multipliers as the primary resource. These architectures suit platforms with abundant memory but limited multiplier availability, including some microprocessors and specialized digital signal processors.

A simple memory-based architecture uses dual-port RAM: one port reads samples while the other reads coefficients. A single multiply-accumulate unit processes the streams, producing one output per N clock cycles. Memory bandwidth of two reads per cycle supports this computation without contention.

Multi-bank memory organizations increase parallelism. With K sample memory banks and K coefficient memory banks, K multiply-accumulate operations occur each cycle. The N taps distribute across K processors, each handling N/K taps, reducing the output computation time to N/K cycles.

Address generation for memory-based architectures manages circular buffer access and coefficient sequencing. Dedicated address generation units or specialized addressing modes in DSP processors efficiently compute the required addresses. The regular access patterns of FIR filtering are well-matched to hardware address generators.

Comparison and Selection

Selecting the appropriate FIR architecture requires evaluating multiple factors including throughput requirements, resource constraints, power consumption, latency tolerance, and the target implementation platform. Each architecture offers distinct advantages that make it optimal for particular combinations of requirements.

Throughput Considerations

Systolic arrays and fully parallel transposed structures achieve the highest throughput, processing one sample per clock cycle regardless of filter length. These architectures suit applications like radar processing, software-defined radio, and high-speed communications where sample rates exceed hundreds of megasamples per second.

Semi-parallel and folded architectures scale throughput with available hardware. The designer can match throughput to requirements by selecting an appropriate folding factor. These architectures are ideal when throughput requirements are known and resources should not be over-provisioned.

Sequential architectures offer minimum throughput at minimum hardware cost. Applications with sample rates of kilohertz to low megahertz often find sequential implementations adequate. Audio processing, sensor interfaces, and control systems frequently use sequential FIR architectures.

Distributed arithmetic throughput depends on the serial versus parallel implementation choice. Parallel DA achieves one sample per clock; serial DA requires B cycles for B-bit inputs. The architecture's throughput can be tuned through the degree of bit-parallelism implemented.

Resource Considerations

Multiplier count is often the primary resource concern. Systolic and parallel direct forms require N multipliers for an N-tap filter. Symmetric filters reduce this to (N+1)/2. Folded architectures reduce multipliers by the folding factor. Distributed arithmetic and CSD-based designs eliminate dedicated multipliers entirely.

Memory requirements vary significantly across architectures. DA requires lookup tables growing exponentially with partition size. Memory-based folded architectures need RAM for samples and coefficients. Systolic and direct form structures use distributed register storage proportional to filter length.

Adder resources are substantial in CSD-based and DA implementations. CSD requires adders for each non-zero digit across all coefficients, though common subexpression elimination reduces this count. DA requires adders for accumulating lookup table outputs and combining partitioned results.

Control complexity increases with folding and time-multiplexing. Sequential and semi-parallel architectures need counters, address generators, and sequencing logic absent from purely parallel implementations. This overhead is modest but should be considered in resource accounting.

Power Considerations

Dynamic power consumption relates to switching activity, which depends on architecture and data characteristics. Architectures with shorter data paths and local communication patterns, like systolic arrays, tend toward lower power than those requiring global data distribution.

DA architectures avoid multiplier power by substituting memory accesses. Memory power consumption depends on access frequency and memory technology. For some implementations, DA achieves significant power savings; for others, the memory power may exceed the saved multiplier power.

Folded architectures can reduce power through resource reduction, but increased clock rate for equivalent throughput may offset savings. Analysis must consider both static (leakage) and dynamic (switching) power contributions across the complete architecture.

Clock gating and power gating opportunities exist in architectures with conditional operation. Folded architectures with partial activity can gate unused resources. Adaptive filters might gate coefficient update logic during steady-state operation. Exploiting these opportunities requires careful architectural design.

Platform Considerations

FPGA implementations benefit from architectures matching the available resources. Modern FPGAs include DSP blocks optimized for multiply-accumulate operations, making direct form and semi-parallel architectures natural fits. Abundant block RAM favors DA implementations. The specific FPGA family's resource mix guides architecture selection.

ASIC implementations can optimize any architecture to the application requirements. Custom multipliers, memories, and datapaths achieve higher performance and lower power than general-purpose building blocks. The ASIC's performance and cost requirements determine the appropriate parallelism level.

Software implementations on general-purpose processors favor architectures matching the instruction set and memory hierarchy. SIMD instructions accelerate parallel operations across multiple samples or coefficients. Cache-friendly access patterns improve memory system efficiency. Processor-specific optimization may favor different architectures on different targets.

DSP processors with specialized instructions for filtering often implement direct form or transposed structures efficiently. Circular addressing modes support delay line management. Multiply-accumulate instructions directly implement the core FIR operation. Software for these processors typically uses straightforward direct form implementations.

Summary

FIR filter architectures span a broad spectrum from simple direct form structures to sophisticated systolic arrays and multiplierless implementations. The direct form provides the conceptual foundation, with the transposed form offering improved timing characteristics for high-speed implementation. Linear phase filters benefit from symmetric coefficient exploitation in either form.

Systolic arrays achieve maximum throughput through regular structures of identical processing elements, with data flowing rhythmically through the array. Weight-stationary, output-stationary, and two-dimensional configurations address different application requirements. The inherent parallelism and local connectivity of systolic designs enable extreme performance.

Distributed arithmetic and canonical signed digit representations eliminate explicit multipliers through algorithmic transformation. DA trades multiplications for table lookups, while CSD reduces multiplications to sequences of additions. Both techniques require fixed coefficients but can dramatically reduce resource requirements for appropriate applications.

Polyphase decomposition restructures filters for efficient multirate processing and parallel implementation. The transformation exposes inherent parallelism that decimation and interpolation filters exploit to avoid computing discarded outputs. Parallel processing applications use polyphase structures to increase throughput beyond single-clock-cycle limitations.

Folded architectures reduce hardware through time-multiplexing, accepting lower throughput in exchange for smaller implementations. From fully sequential to semi-parallel configurations, folding enables resource/throughput trade-offs matching application requirements. Memory-based variants suit platforms with abundant storage but limited multiplier resources.

Architecture selection requires evaluating throughput, resources, power, and platform characteristics together. No single architecture is optimal for all applications; each offers advantages in specific contexts. Understanding the full range of FIR architectures enables designers to select and optimize implementations achieving the best balance for their particular requirements.