VLSI 2011

 

 

 

1.A High-Resolution Time-to-Digital Converter on FPGA Using Dynamic Reconfiguration

     A high-resolution high-precision time-to-digital converter (TDC) architecture is presented for implementation on field-programmable gate arrays (FPGAs) supporting dynamic reconfiguration. The proposed architecture relies on multiple parallel high-resolution delay lines implemented by the programmable interconnection points within the routing switch fabric. These delay lines feature a 1-ps resolution over a range of 3 ns. A calibration process is proposed to take process-voltage-temperature variations, as well as clock skew, into account. A TDC with a 50-ps resolution and precision as high as 35 ps has been implemented on a Virtex-II Pro FPGA. Results show that the proposed architecture and calibration process can be used to achieve resolutions as fine as 10 ps.

 

 

 

2.High-Efficiency Processing Schedule for Parallel Turbo Decoders Using QPP Interleaver

   This paper presents a high-efficiency parallel architecture for a turbo decoder using a quadratic permutation polynomial (QPP) interleaver. Conventionally, two half-iterations for different component codewords alternate during the decoding flow. Due to the initialization calculation and pipeline delays in every half-iteration, the functional units in turbo decoders will be idle for several cycles. This inactive period will degrade throughput, especially for small blocks or high parallelism. To resolve this issue, we impose several constraints on the QPP interleaver and rearrange the processing schedule; then the following half-iteration can be executed before the completion of the current half-iteration. Thus, it can eliminate the idle cycles and increase the efficiency of functional units. Based on this modified schedule with 100% efficiency, a parallel turbo decoderwhich contains 32 radix- SISO decoders is implemented with 90 nmtechnology to achieve 1.4 Gb/s while decoding size-4096 blocks for 8 iterations.

 

 

 

3.A Reduced-Complexity Architecture for LDPC Layered Decoding Schemes

Abstract—

  A reduced-complexity low density parity check (LDPC) layered decoding architecture is proposed using an offset permutation scheme in the switch networks. This method requires only one shuffle network, rather than the two shuffle networks which are used in conventional designs. In addition, we use a block parallel decoding scheme by suitably mapping between required memory banks and processing units in order to increase the decoding throughput. The proposed architecture is realized for a 672-bit, rate-1/2 irregular LDPC code on a Xilinx Virtex-4 FPGA device. The design achieves an information throughput of 822 Mb/s at a clock speed of 335 MHz with a maximum of 8 iterations.

 

 

 

 

 

 

4.High-Performance and Compact Architecture for Regular Expression Matching on FPGA*

Abstract

   We present the design, implementation and evaluation of a high-performance architecture for regular expression matching (REM) on field-programmable gate array (FPGA). Each regular expression (regex) is first parsed into a concise token list representation, then compiled to a modular nondeterministic finite automaton (RE-NFA) using a modified version of the McNaughton-Yamada algorithm. The RE-NFA can be mapped directly onto a compact register-transistor level (RTL) circuit. A number of optimizations are applied to improve the circuit performance: (1) spatial stacking is used to construct an REM circuit processing m 1 input characters per clock cycle; (2) single-character constrained repetitions are matched efficiently by parallel shift-register lookup tables; (3) complex character classes are matched by a BRAM based classifier shared across regexes; (4) a multi-pipeline architecture is used to organize a large number of RE-NFAs into priority groups to limit the I/O size of the circuit. We implemented 2,630 unique PCRE regexes from Snort rules (February 2010) in the proposed REM architecture. Based on the place-and-route results from Xilinx ISE 11.1 targeting Virtex5 LX-220 FPGAs, the proposed REM architecture achieved up to 11Gbps concurrent throughput for various regex sets and up to 2.67x the throughput efficiency of other state-of-the-art designs.

 

 

 

 

5.The Effect of Multi-Bit Correlation on the Design of Field-Programmable Gate Array Routing Resources

Abstract—

     As the logic capacity of field-programmable gate arrays (FPGAs) increases, they are being increasingly used to implement large arithmetic-intensive applications. Large arithmetic intensive applications often contain a large proportion of datapath circuits. Since datapath circuits are designed to process

multiple-bit-wide data, FPGAs implementing these circuits often have to transport a large amount of multiple-bit-wide signals from one computing element (such as a logic block, a DSP block, or a multi-bit addressable memory cell) to another. In this work, we investigate the area efficiency of FPGA routing resources for transporting multiple-bit-wide signals. It is shown that, for datapath circuits, the switch patterns used by the conventional routing architecture, which uniformly distribute routing switches across the routing tracks, are inefficient for connecting the computing elements to their tracks. The more efficient multi-bit aware patterns, which contain a densely populated single-bit region and a sparsely populated multi-bit region, can be effectively used to reduce the routing area of FPGAs for implementing arithmetic intensive applications by 6%–10%. It is also shown that the further sharing of configuration memory among the switches within the multi-bit aware patterns does not significantly increase their area efficiency since datapath circuits typically contain a mixture of multi-bit and single-bit signals—while configuration memory sharing can substantially increase the area efficiency of routing resources for transporting multi-bit signals, it also significantly reduces their ability for transporting single-bit signals. More importantly, configuration memory sharing can significantly reduce the effectiveness of the enhanced multi-bit aware patterns—patterns that incorporate both multi-bit aware and single-bit oriented switches within a single region in order to increase its ability fortransporting both single-bit and multi-bit signals.

 

 

 

6.Time Multiplexed VLSI Architecture for Real-Time Barrel Distortion Correction in Video-Endoscopic Images

Abstract

  A low-cost VLSI implementation of real-time correction of barrel distortion for video-endoscopic images is presented in this paper. The correcting mathematical model is based on least-squares estimation. To decrease the computing complexity, we use an odd-order polynomial to approximate the back-mapping expansion polynomial. By algebraic transformation, the approximated polynomial becomes a monomial form which can be solved by Hornor’s algorithm. With the iterative characteristic of Hornor’s algorithm, the hardware cost and memory requirement can be conserved by time multiplexed design. In addition, a simplified architecture of the linear interpolation is used to reduce more computing resource and silicon area. The VLSI architecture of this work contains 13.9-K gates by using a 0.18-μm CMOS process. Compared with some existing distortion correction techniques, this work reduces at least 69 % hardware cost and 75 % memory requirement.

 

 

 

7.The ARPA-MT Embedded SMT Processor and Its RTOS Hardware Accelerator

Abstract

       The high-level modeling and parameterization capabilities of current hardware description languages, as well as the huge integration capacity and flexibility provided by modern fieldprogrammable gate arrays (FPGAs), open the way to designing processors tuned to given applications and favoring specific properties. This paper presents the Advanced Real-time Processor Architecture (ARPA)—MultiThreaded processor—a customizable, synthesizable, and time-predictable processor model optimized for multitasking real-time embedded systems, which efficiently explores modern FPGA technology. A fundamental processor component is the ARPA operating system (OS) coprocessor designed for hardware implementation of the basic real-time OS management functions, such as timing, task scheduling, synchronization and switching, efficient interrupt handling, and verification of the timing constraints. The hardware implementation of these functions allows executing them faster and more predictably, reducing the OS overhead, and improving its determinism. The performance evaluation has shown reductions of one to two orders of magnitude in the execution time of some functions of a real-time executive, in comparison with an analogous software implementation.

 

 

8.Iris Biometrics for Embedded Systems

Abstract—

   In many applications user authentication has to be carried out by portable devices. Usually these devices are personal tokens carried by users, which have many constraints regarding their computational performance, occupied area, and power consumption. These kinds of devices must deal with such constraints,

while also maintaining high performance rates in the authentication process. This paper provides solutions to designing such personal tokens where biometric authentication is required. In this paper, iris biometrics have been chosen to be implemented due to the low error rates and the robustness their algorithms provide. Several design alternatives are presented, and their analyses are reported.With these results, most of the needs required for the development of an innovative identification product are covered. Results indicate that the architectures proposed herein are faster (up to 20 times), and are capable of obtaining error rates equivalent to those based on computer solutions. Simultaneously, the security and cost for large quantities are also improved.

 

 

 

 

9. Channel Estimator and Aliasing Canceller for Equalizing and Decoding Non-Cyclic Prefixed Single-Carrier Block Transmission via MIMO-OFDM Modem

Abstract—

 Without a cyclic prefix (CP), most single-carrier (SC) transmissions can not adopt frequency-domain equalizer (FDE) directly. This work utilizes frequency-domain channel estimator (FD-CE) and decision- feedback aliasing canceller (DF-AC) to produce single-FFT SC-FDE. In this way, non-CP single-carrier block transmission (SCBT) can be decoded using sphere decoder of MIMO-OFDM modems to support multimode and backward compatibility under an acceptable complexity in IEEE 802.11 very high throughput (VHT). An N-point FFT is sufficient to measure channel frequency responses (CFR) from -sample preambles (L≤N/2). And then, M-bit block codes(M≤N) are decodable over frequency domains with DF-AC’s help. Simulations and measurements imply that this work can ensure adequate performance, even if there is no CP existed against the distortions of multipath propagation.

 

 

 

 

10.Raising FPGA Logic Density Through Synthesis-Inspired Architecture

Abstract—

  We leverage properties of the logic synthesis netlist to define both a new field-programmable gate-array (FPGA) logic element (function generator) architecture and an associated technology mapping algorithm that together provide improved logic density. We demonstrate that an “extended” logic element with slightly modified k-input lookup tables (LUTs) achieves much of the benefit of an architecture with k+1 -input LUTs, while consuming silicon area close to a k-LUT (a k -LUT requires half the area of a k+1-LUT). We introduce the notion of “non-inverting paths” in a circuit’s AND-inverter graph (AIG) and show their utility in mapping into the proposed logic element architectures. We propose a general family of logic element architectures, and present results showing that they offer a variety of area/performance tradeoffs. One of our key results demonstrates that while circuits mapped to a traditional 5-LUT architecture need 15% more LUTs and have 14% more depth than a 6-LUT architecture, our extended 5-LUT architecture requires only 7% more LUTs and 5% more depth than 6-LUTs, on average. Nearly all of the depth reduction associated with moving from k-input to k+1-input LUTs can be achieved with considerably less area using extended k-LUTs. We further show that 6-LUT optimal mapping depths can be achieved with a small fraction of the LUTs in hardware being 6-LUTs and the remainder being extended 5-LUTs, suggesting that a heterogeneous logic block architecture may prove to be advantageous.

 

 

 

 

11.High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

 

Abstract—

  In this brief, by operating the shifting and addition in parallel, an error-compensated adder-tree (ECAT) is proposed to deal with the truncation errors and to achieve low-error and high-throughput discrete cosine transform (DCT) design. Instead of the 12 bits used in previous works, 9-bit distributed arithmetic-precision is chosen for this work so as to meet peak-signal-to-noise-ratio (PSNR) requirements. Thus, an area-efficient DCT core is implemented to achieve 1 Gpels/s throughput rate with gate counts of 22.2 K for the PSNR requirements outlined in the previous works.

 

 

 

12.Optimal Generation of Space–Time Trellis Codes via Coset Partitioning

Abstract

   Criteria for designing good space–time trellis codes (STTCs) have been developed in previous publications. However, the computation of the best STTCs is time consuming, because a long exhaustive or systematic computing search is required, particularly for a high number of states and/or transmit antennas. To reduce the search time, an efficient method must be employed to generate the STTCs with the best performance. In this paper, a technique called coset partitioning is proposed to easily and efficiently design optimal 2n-phase-shift keying (PSK) STTCs with any number of transmit antennas. Coset partitioning is an improved extension to multiple-input–multiple-output (MIMO) systems of the set partitioning proposed by Ungerboeck. This extension is based on the lattice and coset Calderbank approach. With this method, optimal blocks of the generator matrix are obtained for 4-PSK and 8-PSK codes. These optimal blocks lead to the generation of the STTCs with the best Euclidean distances between the codewords. Thus, new codes are proposed with three to six transmit antennas for 4-PSK modulation and with three and four transmit antennas for 8-PSK modulation. These new codes outperform the corresponding best known codes. In addition, the first 4-PSK STTCs with seven and eight transmit antennas and the first 8-PSK STTCs with five and six transmit antennas are given, and their performance is evaluated by simulation.

 

 

 

13.Optimizing Floating Point Units in Hybrid FPGAs

Abstract—

    This paper introduces a methodology to optimize coarse-grained floating point units (FPUs) in a hybrid field-programmable gate array (FPGA), where the FPU consists of a number of interconnected floating point adders/subtracters (FAs), multipliers (FMs), and wordblocks (WBs). The wordblocks include registers and lookup tables (LUTs) which can implement fixed point operations efficiently. We employ common subgraph extraction to determine the best mix of blocks within an FPU and study the area, speed and utilization tradeoff over a set of floating point benchmark circuits. We then explore the system impact of FPU density and flexibility in terms of area, speed, and routing resources. Finally, we derive an optimized coarse-grained FPU by considering both architectural and system-level issues. This proposed methodology can be used to evaluate a variety of FPU architecture optimizations. The results for the selected FPU architecture optimization show that although high density FPUs are slower, they have the advantages of improved area, area-delay product, and throughput.

 

 

 

 

 

14.Reduced-Complexity Decoder Architecture for Non-Binary LDPC Codes

 

Abstract—

 Non-binary low-density parity-check (NB-LDPC) codes can achieve better error-correcting performance than binary LDPC codes when the code length is moderate at the cost of higher decoding complexity. The high complexity is mainly caused by the complicated computations in the check node processing and the large memory requirement. In this paper, a novel check node processing scheme and corresponding VLSI architectures are proposed for the Min-max NB-LDPC decoding algorithm. The proposed scheme first sorts out a limited number of the most reliable variable-to-check (v-to-c) messages, then the check-to-variable (c-to-v) messages to all connected variable nodes are derived independently from the sorted messages without noticeable performance loss. Compared to the previous iterative forward-backward check node processing, the proposed scheme not only significantly reduced the computation complexity, but eliminated the memory required for storing the intermediate messages generated from the forward and backward processes. Inspired by this novel c-to-v message computation method, we propose to store the most reliable v-to-c messages as “compressed” c-to-v messages. The c-to-v messages will be recovered from the compressed format when needed. Accordingly, the memory requirement of the overall decoder can be substantially reduced. Compared to the previous Min-max decoder architecture, the proposed design for a (837, 726) code over GF(25) can achieve the same throughput with only 46% of the area.

 

 

 

15. A Hardware Implementation of a Run-Time Scheduler for Reconfigurable Systems

Abstract—

 New generation embedded systems demand high performance, efficiency, and flexibility. Reconfigurable hardware can provide all these features. However, the costly reconfiguration process and the lack of management support have prevented a broader use of these resources. To solve these issues we have developed a scheduler that deals with task-graphs at run-time, steering its execution in the reconfigurable resources while carrying out both prefetch and replacement techniques that cooperate to hide most of the reconfiguration delays. In our scheduling environment, task-graphs are analyzed at design-time to extract useful information. This information is used at run-time to obtain near-optimal schedules, escaping from local-optimum decisions, while only carrying out simple computations. Moreover, we have developed a hardware implementation of the scheduler that applies all the optimization techniques while introducing a delay of only a few clock cycles. In the experiments our scheduler clearly outperforms conventional run-time schedulers based on as-soon-as-possible techniques. In addition, our replacement policy, specially designed for reconfigurable systems, achieves almost optimal results both regarding reuse and performance.

 

 

 

 

16.FPGA Logic Block to Implement Resource-Efficient Multiplexers

 

Abstract—

 This paper presents an input multiplexer structure for implementing multiplexers from user designs in FPGA CLBs (configurable logic blocks). Input multiplexers providing the LUT (look-up table) data input signals are modified to function not just based on values stored in configuration memory cells, but also under the control of signals from a user’s circuit. Thus, the proposed input multiplexers are much more flexible than traditional input multiplexers. Compared with an FPGA using traditional input multiplexers, the FPGA using the proposed input multiplexers achieves fewer LUTs, less power and higher operating clock frequency for user circuits including a large number of multiplexers.

 

 

 

17.VLSI Implementation of a Mixed Bio-signal Lossless Data Compressor for Portable Brain-Heart Monitoring Systems

 

  This paper presents a highly integrated VLSI implementation of a mixed bio-signal lossless data compressor capable of handling multichannel electroencephalogram (EEG), electrocardiogram (ECG) and diffuse optical tomography (DOT) bio-signal data for reduced storage and communication bandwidth requirements in portable, wireless brain-heart monitoring systems for use in the hospital or home care setting.

 

 

 

18.Efficient Multi-Input/Multi-Output VLSI Architecture for Two-Dimensional Lifting-Based Discrete Wavelet Transform

 

Abstract—

   This brief paper proposes an efficient multi-input/multi-output VLSI architecture (MIMOA) for two-dimensional lifting-based discrete wavelet transform (DWT). The novelty is the simplicity and generality to construct the MIMOA, hich is a high-speed architecture with computing time as low as N2/M for an N x N image with controlled increase of hardware cost. M is the throughput rate.

 

 

19.Stochastic Analysis of the Normalized Subband Adaptive Filter Algorithm

Abstract—

  This paper studies the statistical behavior of the normalized subband adaptive filtering (NSAF) algorithm. An accurate statistical model of the NSAF algorithm is obtained. In the derivation, we focus on Gaussian correlated input signals. By assuming that the analysis filter bank is paraunitary and taking into account the full band adaptation mechanism of the NSAF, expressionsfor the first and the second moments of the adaptive filter weights are derived without invoking the slow adaptation assumption. In the derivations, several hyperelliptic integrals appear. To tackle those integrals induced by Gaussian correlated inputs, we first give a solution by resorting to the adaptive Lobatto quadrature. By invoking the averaging principle, two other approximation methods, the chi-square method and the partial fraction expansionmethod, are presented to approximate the statistical model as well.Monte Carlo (MC) simulation results corroborate our predictions. The Lobatto quadrature method achieves a good agreement with the MC simulation results, even for a relatively large step size. Compared with the chi-square method and the partial fraction expansion method, the Lobatto quadrature method gives better performance in terms of predicting the mean square error when the length of the adaptive filters is small to medium. The chi-square approximation method and the partial fraction expansion method give a satisfactory performance with a relatively low computational complexity when the filter length is large.

 

 

 

20.IR-Drop Aware Clustering Technique for Robust Power Grid in FPGAs

Abstract—

    IR-drop management in the power supply network of a chip is one of the critical design challenges in nanometer VLSI circuits. Techniques developed for application-specific integrated circuits cannot be directly applied for IR drop management in field-programmable gate arrays (FPGAs) because of the programmable nature of FPGAs. This paper proposes a novel clustering technique for improving the supply voltage profile in power grid of FPGAs. The proposed clustering technique not only improves the minimum voltage at any node in the circuit, but also reduces the variance in supply voltage across the nodes in the power grid. Results indicate that a reduction of up to 36% in IR-drop and 27% in spatial VDD variation can be achieved using the proposed clustering technique.

 

 

 

 

21.Interpolation-Based QR Decomposition and Channel Estimation Processor for MIMO-OFDM System

 

Abstract—

  This paper presents a modified interpolation-based QR decomposition algorithm for the grouped-ordering multiple input multiple- output (MIMO) orthogonal frequency division multiplexing (OFDM) systems. Based on the original research that integrates the calculations of the frequency-domain channel estimation and the QR decomposition for the MIMO-OFDM system, this study proposes a modified algorithm that possesses a scalable property to save the power consumption for interpolation-basedQR decomposition in the variable-rank MIMO scheme. Furthermore, we also develop the general equations and a timing scheduling method for the hardware design of the proposed QR decomposition processor for the higher-dimension MIMO system. Based on the proposed algorithm, a configurable interpolation-based QR decomposition and channel estimation processor was designed and implemented using a 90-nm one-poly nine-metal CMOS technology. The processor supports 2x2, 2x4 and 4x4 QR-based MIMO detection for the 3GPP-LTE MIMO-OFDM system and achieves the throughput of 35.16 MQRD/s at its maximum clock rate 140.65 MHz.

 

 

 

22.Transmit Processing Techniques Based on Switched Interleaving and Limited Feedback for Interference Mitigation in Multiantenna MC-CDMA Systems

Abstract

   In this paper, we propose transmit processing techniques based on novel switched interleaving, chipwise linear precoding, and limited feedback for both downlink and uplink multicarrier code-division multiple access (MC-CDMA)

multiple-antenna systems. We develop transceiver structures with switched interleaving, linear precoding, and detectors for both uplink and downlink using limited-feedback techniques. In the proposed schemes, a set of possible chip interleavers is constructed and prestored at both the base station (BS) and mobile stations (MSs). For the downlink, a new hybrid transmit processing technique based on switched interleaving and chipwise precoding is proposed to suppress the multiuser interference. The BS and MSs are also equipped with another codebook of quantized downlink channel-state information (CSI). Each MS quantizes its own downlink CSI and feeds the index back to the BS through a low-rate feedback channel. Then, the selection function at the BS determines the optimum interleaver based on the CSI of all users. Moreover, a transmit processing technique for the uplink of multiple-antenna MC-CDMA systems, which requires a very low rate of feedback information, is also proposed. Codebook design methods for both interleavers and quantized CSI are also proposed. Simulation results show that the performance of the proposed techniques is significantly better than prior art.

 

 

 

23.Iterative Multiuser Detectors for Spatial–Frequency– Time-Domain Spread Multi-Carrier DS-CDMA Systems

Abstract—

 This paper presents three low-complexity, yet effective, multiuser detectors (MUDs) for the uplink of spatial–frequency– time-domain (SFT-domain) spread multicarrier (MC) direct- sequence code-division multiple-access (DS-CDMA) systems. Each MUD first converts a received signal into the corresponding format. It then detects the transmitted symbols iteratively by using a one-domain (1D) minimum-mean-square-error (MMSE) detector and a two-domain (2D) MMSE detector alternately. The weights of the 1D detector are updated to minimize the mean-square error (MSE) of the detection output by setting the weights of the 2D detector at the values determined in the previous iteration, and vice versa. Performance analyses of the bit-error-rate (BER) bound, the convergence, and the required computational complexity are conducted to assess the feasibility of the proposed MUDs. Finally, the results of computer simulations demonstrate that the performance of the proposed schemes is close to that of the joint SFT-domain MMSE MUD in most scenarios but with lower computational overheads.

 

 

 

24.Transmit Precoding for Flat-Fading MIMO Multiuser Systems With Maximum Ratio Combining Receivers

Abstract

   We examine the application of transmit precoding in multiuser multi-input–multi-output (MIMO) communication systems withmaximum ratio combining (MRC) receivers. In many multiuser applications, the maximum-likelihood or minimum mean-square error (MMSE) receivers can be prohibitive to implement due to their high implementation complexity. We examine the performance of the system with simple MRC receivers and carefully selected precoders, which are designed to compensate the lack of high-complexity receivers, at the transmitter side. We examine the sum MSE minimization and signal-to-interference-plus-noise ratio (SINR) balancing frameworks for the selection of precoders. The performance of two frameworks with MRC receivers are compared between themselves as well as with their counterparts implementing MMSE receivers. It has been observed that the SINR balancing framework with simple MRC receivers has little performance loss in comparison with the MMSE receivers with a proper selection of precoders.

 

 

 

 

 

 

 

 

25.Generalized Space-Time Shift Keying Designed for Flexible Diversity-, Multiplexing- and Complexity-Tradeoffs

 

Abstract

 In this paper, motivated by the recent concept of Spatial Modulation (SM), we propose a novel Generalized Space-Time Shift Keying (G-STSK) architecture, which acts as a unified Multiple-Input Multiple-Output (MIMO) framework. More specifically, our G-STSK scheme is based on the rationale that 𝑃 out of 𝑄 dispersion matrices are selected and linearly combined in conjunction with the classic PSK/QAM modulation, where activating 𝑃 out of 𝑄 dispersion matrices provides an implicit means of conveying information bits in addition to the classic modem. Due to its substantial flexibility, our G-STSK framework includes diverse MIMO arrangements, such as SM, Space-Shift Keying (SSK), Linear Dispersion Codes (LDCs), Space-Time Block Codes (STBCs) and Bell Lab’s Layered Space- Time (BLAST) scheme. Hence it has the potential of subsuming all of them, when flexibly adapting a set of system parameters. Moreover, we also derive the Discrete-input Continuous-output Memoryless Channel (DCMC) capacity for our G-STSK scheme, which serves as the unified capacity limit, hence quantifying the capacity of the class of MIMO arrangements. Furthermore, EXtrinsic Information Transfer (EXIT) chart analysis is used for designing our G-STSK scheme and for characterizing its iterative decoding convergence.

 

 

 

 

26.Efficient Iterative Techniques for Soft Decision Decoding of Reed-Solomon Codes

 

Abstract

  Two new iterative soft decision decoding methods for Reed-Solomon (RS) codes are proposed. These methods are based on bit level belief propagation (BP) decoding. In order to make BP decoding effective for RS codes, we use an extended binary parity check matrix with a lower density and reduced number of 4-cycles compared to the original binary parity check matrix of the code. In the first proposed method, we take advantage of the cyclic structure of RS codes. Based on this property, we can apply the belief propagation algorithm on any cyclically shifted version of the received symbols with the same binary parity check matrix. For each shifted version of received symbols, the distribution of reliability values will change and deterministic errors can be avoided. This method results in considerable performance improvement of RS codes compared to hard decision decoding. The performance is also superior to some popular soft decision decoding methods. The second method is based on information correction in BP decoding. It means that we determine least reliable bits and by changing their channel information, the convergence of the decoder is improved. Compared to the first method, this method needs less BP iterations (less complexity) but its performance is not as good.

 

 

 

27.Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation

 

Abstract—

   In this paper, we show how to reduce the computation of correctly rounded square roots of binary floating-point data to the fixed-point evaluation of some particular integer polynomials in two variables. By designing parallel and accurate evaluation schemes for such bivariate polynomials, we show further that this approach allows for high instruction-level parallelism (ILP) exposure, and thus, potentially low-latency implementations. Then, as an illustration, we detail a C implementation of our method in the case of IEEE 754- 2008 binary32 floating-point data (formerly called single precision in the 1985 version of the IEEE 754 standard). This software implementation, which assumes 32-bit unsigned integer arithmetic only, is almost complete in the sense that it supports special operands, subnormal numbers, and all rounding-direction attributes, but not exception handling (that is, status flags are not set). Finally, we have carried out experiments with this implementation on the ST231, an integer processor from the STMicroelectronics’ ST200 family, using the ST200 family VLIW compiler. The results obtained demonstrate the practical interest of our approach in that context: for all rounding-direction attributes, the generated assembly code is optimally scheduled and has indeed low latency (23 cycles).

 

 

 

28.Systematic Design of RSA Processors Based on High-Radix Montgomery Multipliers

 

Abstract—

 This paper presents a systematic design approach to provide the optimized Rivest–Shamir–Adleman (RSA) processors based on high-radix Montgomery multipliers satisfying various user requirements, such as circuit area, operating time, and resistance against side-channel attacks. In order to involve the tradeoff between the performance and the resistance, we apply four types of exponentiation algorithms: two variants of the binary method with/without Chinese Remainder Theorem (CRT). We also introduces three multiplier-based datapath-architectures using different intermediate data forms: 1) single form, 2) semi carry-save form, and 3) carry-save form, and combined them with a wide variety of arithmetic components. Their radices are parameterized from 28 to 2128. A total of 242 datapaths for 1024-bit RSA processors were obtained for each radix. The potential of the proposed approach is demonstrated through an experimental synthesis of all possible processors with a 90-nm CMOS standard cell library. As a result, the smallest design of 861 gates with 118.47 ms/RSA to the fastest design of 0.67 ms/RSA at 153 862 gates were obtained. In addition, the use of the CRT technique reduced the RSA operation time of the fastest design to 0.24 ms. Even if we employed the exponentiation algorithm resistant to typical side-channel attacks, the fastest design can perform the RSA operation in less than 1.0 ms.

 

 

29.State Metric Compression Techniques for Turbo Decoder Architectures

Abstract—

 This papers proposes to compress state metrics in turbo decoder architectures to reduce the decoder area. Two techniques are proposed: the first is based on non-uniform quantization and the second on the Walsh–Hadamard transform followed by non-uniform quantization. The non-uniform quantization technique reduces state metric memory area of about 50% compared with architectures where state metric compression is not performed, at the expense of slightly increasing the error correcting performance floor. On the other hand, the Walsh–Hadamard transform based solution offers a good tradeoff between performance loss and memory complexity reduction, which reaches in the best case 20% of gain with respect to other approaches. Both solutions show lower power consumption than architectures previously proposed to compress state metrics.

 

 

 

 

 

30.A Low-Power 64-point Pipeline FFT/IFFT Processor for OFDM Applications

 

Abstract—

 4G and other wireless systems are currently hot topics of research and development in the communication field. Broadband wireless systems based on orthogonal frequency division multiplexing (OFDM) often require an inverse fast Fourier transform (IFFT) to produce multiple subcarriers. In this paper, we present the efficient implementation of a pipeline FFT/IFFT processor for OFDM applications. Our design adopts a single-path delay feedback style as the proposed hardware architecture. To eliminate the read-only memories (ROM’s) used to store the twiddle factors, the proposed architecture applies a reconfigurable complex multiplier and bit-parallel multipliers to achieve a ROM-less FFT/IFFT processor, thus consuming lower power than the existing works. The design spends about 33.6K gates, and its power consumption is about 9.8mW at 20MHz1.

 

 

 

31.An Average-Performance-Oriented Subthreshold Processor Self-Timed by Memory Read Completion

Abstract

 Aself-timed subthreshold processor was developed in 65-nm complimentary metal–oxide–semiconductor process. This four-stage reduced instruction set computer processor synchronously operates with the memory read completion signal produced in 8.5-kb instruction and 2-kb data memories of subthreshold 10T static random-access memory. Measurement results show that the processor correctly functions from 0.56 to 0.36 V with a self-timed clock and achieves minimum energy per cycle of 3.47 pJ/cycle at 0.46-V supply voltage with 1.76-MHz average frequency. Compared with conventional synchronous operation

with guardbanding, the proposed self-timed operation reduces the execution time of SHA-1 by 82% at 0.4-V supply voltage and saves energy by 40% to attain 1-MHz operation.

 

 

 

 

32.Exploring Area and Delay Tradeoffs in FPGAs With Architecture and Automated Transistor Design

 

Abstract—

 Field-programmable gate arrays (FPGAs) are used in a variety of markets that have differing cost, performance and power consumption requirements. While it would be ideal to serve all these markets with a single FPGA family, the diversity in the needs of these markets means that generally more than one family is appropriate. Consequently, FPGA vendors have moved to provide a diverse set of families that sit at different points in the areaspeed- power design space. This paper aims to understand the circuit and architectural design attributes of FPGAs that enable tradeoffs between area and speed, and to determine the magnitude of the possible tradeoffs. This will be useful for architects seeking to determine the number of device families in a suite of offerings, as well as the changes to make between families. We explore a broad range of architectures and circuit designs and developed a transistor sizing tool that automatically optimizes each design. In this paper, we describe this tool and demonstrate that it achieves results that are comparable to past work but with vastly less effort. We then use the designs produced by the tool to explore the range of tradeoffs possible. We find that through architecture and transistor sizing changes it is possible to usefully vary the area of an FPGA by a factor of 2.0 and the performance of an FPGA by a factor of 2.1. We also observe that the range of area and delay tradeoffs possible by varying only the transistor sizing of a single architecture is larger than the ranges observed in past architectural experiments. In addition to transistor size, we note that LUT size is one of the most useful parameters for trading off area and delay.

 

 

33.Low power field programmable gate array implementation of fast digital signal processing algorithms: characterisation and manipulation of data locality

Abstract:

   Dynamic power consumption is very dependent on interconnect, so clever mapping of digital signal processing algorithms to parallelised realisations with data locality is vital. This is a particular problem for fast algorithm implementations where typically, designers will have sacrificed circuit structure for efficiency in software implementation. This study outlines an approach for reducing the dynamic power consumption of a class of fast algorithms by minimising the index space separation; this allows the generation of field programmable gate array (FPGA) implementations with reduced power consumption. It is shown how a 50% reduction in relative index space separation results in a measured power gain of 36 and 37% over a Cooley–Tukey Fast Fourier Transform (FFT)-based solution for both actual power measurements for a Xilinx Virtex-II FPGA implementation and circuit measurements for a Xilinx Virtex-5 implementation. The authors show the generality of the approach by applying it to a number of other fast algorithms namely the discrete cosine, the discrete Hartley and the Walsh–Hadamard transforms.

 

 

IEEE 2010

34.FPGA Implementation of Digital Up/Down Convertor for WCDMA System

Abstract-

   In this paper, we present FPGA implementation of a digital down convertor (DDC) and digital up convertor (DUC) for a single carrier WCDMA system. The DDC and DUC is complex in nature. The implementation of DDC is simple

because it does not require mixers or filters. Xilinx System Generator and Xilinx ISE are used to develop the hardware circuit for the FPGA. Both the circuits are verified on the Virtex-4 FPGA.

35.Low-Complexity Viterbi Decoder for Space-Time Trellis Codes

Abstract—

  Space-time trellis code (STTC) has been widely applied to coded multiple-input multiple-output (MIMO) systems because of its gains in coding and diversity; however, its great decoding complexity makes it less promising in chip realization compared to the space-time block code (STBC). The complexity of STTC decoding lies in the branch metric calculation in the Viterbi algorithm and increases significantly along with the number of antennas and the modulation order. Consequently, a low-complexity algorithm to mitigate the computational burden is proposed. The results show that more than 70%, 78%, and 83% of the computational complexity is reduced for 2x2, 3x3, and  4x4 MIMO configurations, respectively. Based on the proposed algorithm, a reconfigurable MISO STTC Viterbi decoder is designed and implemented using 0.18 m 1P6M CMOS technology. The decoder achieves 11.14 Mbps, 8.36 Mbps, and 5.75 Mbps for 4-PSK, 8-PSK, and 16-QAM modulations, respectively.

 

 

36.An Energy Efficient Layered Decoding Architecture for LDPC Decoder

 

Abstract—

  Low-density parity-check (LDPC) decoder requires large amount of memory access which leads to high energy consumption. To reduce the energy consumption of the LDPC decoder, memory-bypassing scheme has been proposed for the layered decoding architecture which reduces the amount of access to the memory storing the soft posterior reliability values. In this work, we present a scheme that achieves the optimal reduction of memory access for the memory bypassing scheme. The amount of achievable memory bypassing depends on the decoding order of the layers. We formulate the problem of finding the optimal decoding order and propose algorithm to obtain the optimal solution. We also present the corresponding architecture which combines some of memory components and results in reduction of memory area. The proposed decoder was implemented in TSMC 0.18 µm CMOS process. Experimental results show that for a LDPC decoder targeting IEEE 802.11n specification, the amount of memory access values can be reduced by 12.9–19.3% compared with the state-of-the-art design. At the same time, 95.6%–100% hardware utilization rate is achieved.

 

 

37.Improving FPGA Performance for Carry-Save Arithmetic

 

Abstract—

 The selective use of carry-save arithmetic, where appropriate, can accelerate a variety of arithmetic-dominated circuits. Carry-save arithmetic occurs naturally in a variety of DSP applications, and further opportunities to exploit it can be exposed through systematic data flow transformations that can be applied by a hardware compiler. Field-programmable gate arrays (FPGAs), however, are not particularly well suited to carry-save arithmetic. To address this concern, we introduce the “field programmable counter array” (FPCA), an accelerator for carry-save arithmetic intended for integration into an FPGA as an alternative to DSP blocks. In addition to multiplication and multiply accumulation, the FPCA can accelerate more general carry-save operations, such as multi-input addition (e.g., add k>2 integers) and multipliers that have been fused with other adders. Our experiments show that the FPCA accelerates a wider variety of applications than DSP blocks and improves performance, area utilization, and energy consumption compared with soft FPGA logic.

 

 

38.Parallel Interleavers Through Optimized Memory Address Remapping

Abstract—

   This work presents mathematical models and collision- free exchange rules for a parallel interleaver, using which it develops an optimized memory address remapping (OPMM) scheme that enables a classic interleaver to be exchanged for a parallel interleaver readily and efficiently. Both analytic and experimental results demonstrate that the rate of annealing achieved using the OPMM approach is much faster than that achieved using the traditional memory address remapping (MM) method.

 

 

39.Design and Implementation of Low-Power ANSI S1.11 Filter Bank for Digital Hearing Aids

Abstract—

    Due to well matching the frequency characteristics of human ears, ANSI S1.11 1/3-octave filter bank is popular in acoustic applications, such as acoustic analyzers and equalizers. It is also desirable in hearing aids because the famous hearing aid prescription formula, NAL-NL1, prescribes its gains at ANSI 1/3-octave frequencies. However, the high computation complexity limits its usage, in which the power consumption is a critical concern. To address this issue, a low-power design and implementation of ANSI S1.11 filter bank for digital hearing aids is present. We first develop the complexity-effective multirate FIR filter bank algorithm. And, a systematic coefficient design flow is elaborated for the proposed filter bank to minimize the order of the FIR filter thereof. In an 18-band digital hearing aid with 24-kHz sampling rate, the proposed algorithm saves about 96% of multiplications and additions, comparing that with a straightforward FIR filter bank. Moreover, various low-power VLSI design techniques are investigated in detail and applied on our design. The proposed complexity- effective ANSI S1.11 FIR filter bank has been implemented in the TSMC 0.13- m CMOS technology with an area-efficient architecture. The test chip consumes only 87 W, which is 30%–79% of that of the others available in the literature. The proposed lowpower ANSI 1/3-octave bank makes itself being able to precisely apply the prescribed gains obtained by NAL-NL1 prescription formula for hearing-impaired people.

 

 

40.IEC Control Specification to HDL Synthesis: Considerations for Implementing PLC on FPGA and Scope for Research

 

Abstract

  Today’s machine automation systems are demanding for better throughput, faster response, built in safety features and high speed communication support, besides satisfying IEC61131-3 control specification. MEMS sensors & actuators along with increased control logic complexities are stretching limits of conventional Programmable Logic controllers (PLCs) generally used for industrial and high end application control. This is because PLCs are implemented using sequential controllers. Throughput, response time, complex operation and flexible expansion in such case get limited due to typical fetch, decode and execute cycle. Field programmable Gate Arrays (FPGAs) which architecturally can satisfy these requirements, are proposed in the literature as prospective devices to implement PLCs. However, these devices have not been commonly accepted by control & automation industry. One of the important factors is difficulty in understanding FPGA design flow and its design specification standard. Currently, there is no proven interface available in the market to bridge between IEC control specification and FPGA design specification. Authors of this paper have thought over of developing such interface to open up benefits from VLSI design to control and automation domain. The issues and considerations for developing such interface are discussed in this paper. As this is multidisciplinary work, and not much work has been done in this area, possible research opportunities are discussed along with validation platform considerations.

 

 

41.Reducing SRAM Power Using Fine-Grained Wordline Pulsewidth Control

Abstract—

     EmbeddedSRAMdominates modern SoCs, and there is a strong demand forSRAMwith lower power consumption while achieving high performance and high density. However, the large increase of process variations in advanced CMOS technologies is considered one of the biggest challenges for SRAM designers. In the presence of large process variations, SRAMs are expected to consume larger power to ensure correct read operations and meet yield targets. In this paper, we propose a new architecture that significantly reduces the array switching power for SRAM. The proposed architecture combines built-in self-test and digitally controlled delay elements to reduce the wordline pulsewidth for memories while ensuring correct read operations, hence reducing the switching power. Monte Carlo simulations using a 1-Mb SRAM macro in an industrial 45-nm technology are used to verify the power saving for the proposed architecture. For a 48-Mb memory density, a 27% reduction in array switching power can be achieved for a read access yield target of 95%. In addition, the proposed system can provide larger power saving as process variations increase, which makes it an attractive solution for 45-nm-and-below technologies.