
VLSI
2011
1.A
High-Resolution Time-to-Digital Converter on FPGA Using Dynamic Reconfiguration
A high-resolution high-precision time-to-digital converter (TDC)
architecture is presented for implementation on field-programmable gate arrays
(FPGAs) supporting dynamic reconfiguration. The proposed architecture relies on
multiple parallel high-resolution delay lines implemented by the programmable
interconnection points within the routing switch fabric. These delay lines
feature a 1-ps resolution over a range of 3 ns. A calibration process is
proposed to take process-voltage-temperature variations, as well as clock skew,
into account. A TDC with a 50-ps resolution and precision as high as 35 ps has
been implemented on a Virtex-II Pro FPGA. Results show that the proposed
architecture and calibration process can be used to achieve resolutions as fine
as 10 ps.
2.High-Efficiency
Processing Schedule for Parallel Turbo Decoders Using QPP Interleaver
This paper presents a high-efficiency parallel
architecture for a turbo decoder using a quadratic permutation polynomial (QPP)
interleaver. Conventionally, two half-iterations for different component
codewords alternate during the decoding flow. Due to the initialization
calculation and pipeline delays in every half-iteration, the functional units
in turbo decoders will be idle for several cycles. This inactive period will
degrade throughput, especially for small blocks or high parallelism. To resolve
this issue, we impose several constraints on the QPP interleaver and rearrange
the processing schedule; then the following half-iteration can be executed
before the completion of the current half-iteration. Thus, it can eliminate the
idle cycles and increase the efficiency of functional units. Based on this
modified schedule with 100% efficiency, a parallel turbo decoderwhich contains
32 radix- SISO decoders is implemented with 90 nmtechnology to achieve 1.4 Gb/s
while decoding size-4096 blocks for 8 iterations.
3.A
Reduced-Complexity Architecture for LDPC Layered Decoding Schemes
Abstract—
A reduced-complexity low density parity check (LDPC)
layered decoding architecture is proposed using an offset permutation scheme in
the switch networks. This method requires only one shuffle network, rather than
the two shuffle networks which are used in conventional designs. In addition,
we use a block parallel decoding scheme by suitably mapping between required
memory banks and processing units in order to increase the decoding throughput.
The proposed architecture is realized for a 672-bit, rate-1/2 irregular LDPC
code on a Xilinx Virtex-4 FPGA device. The design achieves an information
throughput of 822 Mb/s at a clock speed of 335 MHz with a maximum of 8
iterations.
4.High-Performance and Compact Architecture for Regular
Expression Matching on FPGA*
Abstract—
We present the design, implementation and
evaluation of a high-performance architecture for regular expression matching (REM)
on field-programmable gate array (FPGA). Each regular expression (regex) is
first parsed into a concise token list representation, then compiled to a
modular nondeterministic finite automaton (RE-NFA) using a modified version of
the McNaughton-Yamada algorithm. The RE-NFA can be mapped directly onto a
compact register-transistor level (RTL) circuit. A number of optimizations are
applied to improve the circuit performance: (1) spatial stacking is used to
construct an REM circuit processing m ≥ 1 input characters per clock cycle; (2)
single-character constrained repetitions are matched efficiently by parallel
shift-register lookup tables; (3) complex character classes are matched by a
BRAM based classifier shared across regexes; (4) a multi-pipeline architecture
is used to organize a large number of RE-NFAs into priority groups to limit the
I/O size of the circuit. We implemented 2,630 unique PCRE regexes from Snort rules
(February 2010) in the proposed REM architecture. Based on the place-and-route
results from Xilinx ISE 11.1 targeting Virtex5 LX-220 FPGAs, the proposed REM
architecture achieved up to 11Gbps concurrent throughput for various regex sets
and up to 2.67x the throughput efficiency of other state-of-the-art designs.
5.The Effect of Multi-Bit Correlation on the Design
of Field-Programmable Gate Array Routing Resources
Abstract—
As the logic capacity of field-programmable gate arrays
(FPGAs) increases, they are being increasingly used to implement large
arithmetic-intensive applications. Large arithmetic intensive applications
often contain a large proportion of datapath circuits. Since datapath circuits
are designed to process
multiple-bit-wide data, FPGAs implementing these
circuits often have to transport a large amount of multiple-bit-wide signals
from one computing element (such as a logic block, a DSP block, or a multi-bit
addressable memory cell) to another. In this work, we investigate the area
efficiency of FPGA routing resources for transporting multiple-bit-wide
signals. It is shown that, for datapath circuits, the switch patterns used by
the conventional routing architecture, which uniformly distribute routing
switches across the routing tracks, are inefficient for connecting the
computing elements to their tracks. The more efficient multi-bit aware patterns,
which contain a densely populated single-bit region and a sparsely populated
multi-bit region, can be effectively used to reduce the routing area of FPGAs
for implementing arithmetic intensive applications by 6%–10%. It is also shown
that the further sharing of configuration memory among the switches within the
multi-bit aware patterns does not significantly increase their area efficiency
since datapath circuits typically contain a mixture of multi-bit and single-bit
signals—while configuration memory sharing can substantially increase the area
efficiency of routing resources for transporting multi-bit signals, it also
significantly reduces their ability for transporting single-bit signals. More
importantly, configuration memory sharing can significantly reduce the
effectiveness of the enhanced multi-bit aware patterns—patterns that
incorporate both multi-bit aware and single-bit oriented switches within a
single region in order to increase its ability fortransporting both single-bit
and multi-bit signals.
6.Time
Multiplexed VLSI Architecture for Real-Time Barrel Distortion Correction in Video-Endoscopic
Images
Abstract—
A low-cost
VLSI implementation of real-time correction of barrel distortion for
video-endoscopic images is presented in this paper. The correcting mathematical
model is based on least-squares estimation. To decrease the computing
complexity, we use an odd-order polynomial to approximate the back-mapping
expansion polynomial. By algebraic transformation, the approximated polynomial
becomes a monomial form which can be solved by Hornor’s algorithm. With the
iterative characteristic of Hornor’s algorithm, the hardware cost and memory
requirement can be conserved by time multiplexed design. In addition, a
simplified architecture of the linear interpolation is used to reduce more
computing resource and silicon area. The VLSI architecture of this work
contains 13.9-K gates by using a 0.18-μm CMOS process. Compared with some
existing distortion correction techniques, this work reduces at least 69 %
hardware cost and 75 % memory requirement.
7.The ARPA-MT Embedded SMT Processor and Its
RTOS Hardware Accelerator
Abstract—
The
high-level modeling and parameterization capabilities of current hardware
description languages, as well as the huge integration capacity and flexibility
provided by modern fieldprogrammable gate arrays (FPGAs), open the way to
designing processors tuned to given applications and favoring specific
properties. This paper presents the Advanced Real-time Processor Architecture (ARPA)—MultiThreaded
processor—a customizable, synthesizable, and time-predictable processor model optimized
for multitasking real-time embedded systems, which efficiently explores modern
FPGA technology. A fundamental processor component is the ARPA operating
system (OS) coprocessor designed for hardware implementation of the
basic real-time OS management functions, such as timing, task scheduling,
synchronization and switching, efficient interrupt handling, and verification
of the timing constraints. The hardware implementation of these functions allows
executing them faster and more predictably, reducing the OS overhead, and
improving its determinism. The performance evaluation has shown reductions of
one to two orders of magnitude in the execution time of some functions of a
real-time executive, in comparison with an analogous software implementation.
8.Iris
Biometrics for Embedded Systems
Abstract—
In many applications user authentication has to be carried
out by portable devices. Usually these devices are personal tokens carried by
users, which have many constraints regarding their computational performance,
occupied area, and power consumption. These kinds of devices must deal with
such constraints,
while also maintaining high performance rates in the
authentication process. This paper provides solutions to designing such
personal tokens where biometric authentication is required. In this paper, iris
biometrics have been chosen to be implemented due to the low error rates and
the robustness their algorithms provide. Several design alternatives are
presented, and their analyses are reported.With these results, most of the
needs required for the development of an innovative identification product are
covered. Results indicate that the architectures proposed herein are faster (up
to 20 times), and are capable of obtaining error rates equivalent to those
based on computer solutions. Simultaneously, the security and cost for large
quantities are also improved.
9.
Channel Estimator and Aliasing Canceller for Equalizing and Decoding Non-Cyclic
Prefixed Single-Carrier Block Transmission via MIMO-OFDM Modem
Abstract—
Without a cyclic prefix (CP), most single-carrier
(SC) transmissions can not adopt frequency-domain equalizer (FDE) directly.
This work utilizes frequency-domain channel estimator (FD-CE) and decision- feedback
aliasing canceller (DF-AC) to produce single-FFT SC-FDE. In this way, non-CP
single-carrier block transmission (SCBT) can be decoded using sphere decoder of
MIMO-OFDM modems to support multimode and backward compatibility under an
acceptable complexity in IEEE 802.11 very high throughput (VHT). An N-point FFT
is sufficient to measure channel frequency responses (CFR) from -sample
preambles (L≤N/2). And then, M-bit block codes(M≤N) are decodable over frequency domains with DF-AC’s help. Simulations and measurements
imply that this work can ensure adequate performance, even if there is no CP
existed against the distortions of multipath propagation.
10.Raising
FPGA Logic Density Through Synthesis-Inspired Architecture
Abstract—
We leverage properties of the logic synthesis
netlist to define both a new field-programmable gate-array (FPGA) logic element
(function generator) architecture and an associated technology mapping
algorithm that together provide improved logic density. We demonstrate that an
“extended” logic element with slightly modified k-input lookup tables (LUTs)
achieves much of the benefit of an architecture with k+1 -input LUTs, while
consuming silicon area close to a k-LUT (a k -LUT requires half the area of a k+1-LUT).
We introduce the notion of “non-inverting paths” in a circuit’s AND-inverter
graph (AIG) and show their utility in mapping into the proposed logic element architectures.
We propose a general family of logic element architectures, and present results
showing that they offer a variety of area/performance tradeoffs. One of our key
results demonstrates that while circuits mapped to a traditional 5-LUT
architecture need 15% more LUTs and have 14% more depth than a 6-LUT architecture,
our extended 5-LUT architecture requires only 7% more LUTs and 5% more depth
than 6-LUTs, on average. Nearly all of the depth reduction associated with
moving from k-input to k+1-input LUTs can be achieved with considerably less
area using extended k-LUTs. We further show that 6-LUT optimal mapping depths
can be achieved with a small fraction of the LUTs in hardware being 6-LUTs and
the remainder being extended 5-LUTs, suggesting that a heterogeneous logic
block architecture may prove to be advantageous.
11.High
Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree
Abstract—
In this brief, by operating the shifting and
addition in parallel, an error-compensated adder-tree (ECAT) is proposed to
deal with the truncation errors and to achieve low-error and high-throughput
discrete cosine transform (DCT) design. Instead of the 12 bits used in previous
works, 9-bit distributed arithmetic-precision is chosen for this work so as to
meet peak-signal-to-noise-ratio (PSNR) requirements. Thus, an area-efficient DCT
core is implemented to achieve 1 Gpels/s throughput rate with gate counts of
22.2 K for the PSNR requirements outlined in the previous works.
12.Optimal
Generation of Space–Time Trellis Codes via Coset Partitioning
Abstract—
Criteria for designing good space–time trellis codes (STTCs) have been
developed in previous publications. However, the computation of the best STTCs
is time consuming, because a long exhaustive or systematic computing search is
required, particularly for a high number of states and/or transmit antennas. To
reduce the search time, an efficient method must be employed to generate the
STTCs with the best performance. In this paper, a technique called coset
partitioning is proposed to easily and efficiently design optimal 2n-phase-shift
keying (PSK) STTCs with any number of transmit antennas. Coset partitioning is
an improved extension to multiple-input–multiple-output (MIMO) systems of the
set partitioning proposed by Ungerboeck. This extension is based on the lattice
and coset Calderbank approach. With this method, optimal blocks of the
generator matrix are obtained for 4-PSK and 8-PSK codes. These optimal blocks
lead to the generation of the STTCs with the best Euclidean distances between
the codewords. Thus, new codes are proposed with three to six transmit antennas
for 4-PSK modulation and with three and four transmit antennas for 8-PSK
modulation. These new codes outperform the corresponding best known codes. In
addition, the first 4-PSK STTCs with seven and eight transmit antennas and the first
8-PSK STTCs with five and six transmit antennas are given, and their
performance is evaluated by simulation.
13.Optimizing
Floating Point Units in Hybrid FPGAs
Abstract—
This paper introduces a methodology to optimize coarse-grained floating
point units (FPUs) in a hybrid field-programmable gate array (FPGA), where the
FPU consists of a number of interconnected floating point adders/subtracters
(FAs), multipliers (FMs), and wordblocks (WBs). The wordblocks include
registers and lookup tables (LUTs) which can implement fixed point operations
efficiently. We employ common subgraph extraction to determine the best mix of
blocks within an FPU and study the area, speed and utilization tradeoff over a
set of floating point benchmark circuits. We then explore the system impact of
FPU density and flexibility in terms of area, speed, and routing resources.
Finally, we derive an optimized coarse-grained FPU by considering both
architectural and system-level issues. This proposed methodology can be used to
evaluate a variety of FPU architecture optimizations. The results for the
selected FPU architecture optimization show that although high density FPUs are
slower, they have the advantages of improved area, area-delay product, and
throughput.
14.Reduced-Complexity
Decoder Architecture for Non-Binary LDPC Codes
Abstract—
Non-binary low-density parity-check (NB-LDPC) codes
can achieve better error-correcting performance than binary LDPC codes when the
code length is moderate at the cost of higher decoding complexity. The high
complexity is mainly caused by the complicated computations in the check node processing
and the large memory requirement. In this paper, a novel check node processing
scheme and corresponding VLSI architectures are proposed for the Min-max
NB-LDPC decoding algorithm. The proposed scheme first sorts out a limited
number of the most reliable variable-to-check (v-to-c) messages, then the
check-to-variable (c-to-v) messages to all connected variable nodes are derived
independently from the sorted messages without noticeable performance loss.
Compared to the previous iterative forward-backward check node processing, the
proposed scheme not only significantly reduced the computation complexity, but eliminated
the memory required for storing the intermediate messages generated from the
forward and backward processes. Inspired by this novel c-to-v message
computation method, we propose to store the most reliable v-to-c messages as
“compressed” c-to-v messages. The c-to-v messages will be recovered from the
compressed format when needed. Accordingly, the memory requirement of the
overall decoder can be substantially reduced. Compared to the previous Min-max
decoder architecture, the proposed design for a (837, 726) code over GF(25)
can achieve the same throughput with only 46% of the area.
15. A Hardware Implementation of a Run-Time Scheduler
for Reconfigurable Systems
Abstract—
New generation embedded systems demand high performance,
efficiency, and flexibility. Reconfigurable hardware can provide all these
features. However, the costly reconfiguration process and the lack of
management support have prevented a broader use of these resources. To solve
these issues we have developed a scheduler that deals with task-graphs at
run-time, steering its execution in the reconfigurable resources while carrying
out both prefetch and replacement techniques that cooperate to hide most of the
reconfiguration delays. In our scheduling environment, task-graphs are analyzed
at design-time to extract useful information. This information is used at
run-time to obtain near-optimal schedules, escaping from local-optimum
decisions, while only carrying out simple computations. Moreover, we have
developed a hardware implementation of the scheduler that applies all the optimization
techniques while introducing a delay of only a few clock cycles. In the
experiments our scheduler clearly outperforms conventional run-time schedulers
based on as-soon-as-possible techniques. In addition, our replacement policy,
specially designed for reconfigurable systems, achieves almost optimal results
both regarding reuse and performance.
16.FPGA Logic
Block to Implement Resource-Efficient Multiplexers
Abstract—
This paper presents an input multiplexer structure
for implementing multiplexers from user designs in FPGA CLBs (configurable
logic blocks). Input multiplexers providing the LUT (look-up table) data input signals
are modified to function not just based on values stored in configuration memory
cells, but also under the control of signals from a user’s circuit. Thus, the
proposed input multiplexers are much more flexible than traditional input
multiplexers. Compared with an FPGA using traditional input multiplexers, the
FPGA using the proposed input multiplexers achieves fewer LUTs, less power and
higher operating clock frequency for user circuits including a large number of
multiplexers.
17.VLSI Implementation of a Mixed Bio-signal
Lossless Data Compressor for Portable Brain-Heart Monitoring Systems
This paper
presents a highly integrated VLSI implementation of a mixed bio-signal lossless
data compressor capable of handling multichannel electroencephalogram (EEG),
electrocardiogram (ECG) and diffuse optical tomography (DOT) bio-signal data
for reduced storage and communication bandwidth requirements in portable,
wireless brain-heart monitoring systems for use in the hospital or home care
setting.
18.Efficient Multi-Input/Multi-Output VLSI
Architecture for Two-Dimensional Lifting-Based Discrete Wavelet Transform
Abstract—
This brief paper proposes an efficient
multi-input/multi-output VLSI architecture (MIMOA) for two-dimensional
lifting-based discrete wavelet transform (DWT). The novelty is the simplicity
and generality to construct the MIMOA, hich is a high-speed architecture with
computing time as low as N2/M for an N x N image with controlled increase of
hardware cost. M is the throughput rate.
19.Stochastic Analysis of the Normalized
Subband Adaptive Filter Algorithm
Abstract—
This paper studies the statistical behavior
of the normalized subband adaptive filtering (NSAF) algorithm. An accurate statistical
model of the NSAF algorithm is obtained. In the derivation, we focus on
Gaussian correlated input signals. By assuming that the analysis filter bank is
paraunitary and taking into account the full band adaptation mechanism of the
NSAF, expressionsfor the first and the second moments of the adaptive filter weights
are derived without invoking the slow adaptation assumption. In the
derivations, several hyperelliptic integrals appear. To tackle those integrals
induced by Gaussian correlated inputs, we first give a solution by resorting to
the adaptive Lobatto quadrature. By invoking the averaging principle, two other
approximation methods, the chi-square method and the partial fraction
expansionmethod, are presented to approximate the statistical model as
well.Monte Carlo (MC) simulation results corroborate our predictions. The
Lobatto quadrature method achieves a good agreement with the MC simulation
results, even for a relatively large step size. Compared with the chi-square
method and the partial fraction expansion method, the Lobatto quadrature method
gives better performance in terms of predicting the mean square error when the
length of the adaptive filters is small to medium. The chi-square approximation
method and the partial fraction expansion method give a satisfactory
performance with a relatively low computational complexity when the filter
length is large.
20.IR-Drop
Aware Clustering Technique for Robust Power Grid in FPGAs
Abstract—
IR-drop
management in the power supply network of a chip is one of the critical design
challenges in nanometer VLSI circuits. Techniques developed for
application-specific integrated circuits cannot be directly applied for IR drop
management in field-programmable gate arrays (FPGAs) because of the
programmable nature of FPGAs. This paper proposes a novel clustering technique
for improving the supply voltage profile in power grid of FPGAs. The proposed
clustering technique not only improves the minimum voltage at any node in the
circuit, but also reduces the variance in supply voltage across the nodes in the
power grid. Results indicate that a reduction of up to 36% in IR-drop and 27%
in spatial VDD variation can be achieved using the proposed
clustering technique.
21.Interpolation-Based
QR Decomposition and Channel Estimation Processor for MIMO-OFDM System
Abstract—
This paper presents a modified interpolation-based
QR decomposition algorithm for the grouped-ordering multiple input multiple-
output (MIMO) orthogonal frequency division multiplexing (OFDM) systems. Based
on the original research that integrates the calculations of the
frequency-domain channel estimation and the QR decomposition for the MIMO-OFDM system,
this study proposes a modified algorithm that possesses a scalable property to
save the power consumption for interpolation-basedQR decomposition in the
variable-rank MIMO scheme. Furthermore, we also develop the general equations
and a timing scheduling method for the hardware design of the proposed QR
decomposition processor for the higher-dimension MIMO system. Based on the
proposed algorithm, a configurable interpolation-based QR decomposition and
channel estimation processor was designed and implemented using a 90-nm
one-poly nine-metal CMOS technology. The processor supports 2x2, 2x4 and 4x4
QR-based MIMO detection for the 3GPP-LTE MIMO-OFDM system and achieves the
throughput of 35.16 MQRD/s at its maximum clock rate 140.65 MHz.
22.Transmit Processing Techniques Based on Switched Interleaving
and Limited Feedback for Interference Mitigation in Multiantenna MC-CDMA
Systems
Abstract—
In this
paper, we propose transmit processing techniques based on novel switched
interleaving, chipwise linear precoding, and limited feedback for both downlink
and uplink multicarrier code-division multiple access (MC-CDMA)
multiple-antenna systems. We develop transceiver
structures with switched interleaving, linear precoding, and detectors for both
uplink and downlink using limited-feedback techniques. In the proposed schemes,
a set of possible chip interleavers is constructed and prestored at both the
base station (BS) and mobile stations (MSs). For the downlink, a new hybrid
transmit processing technique based on switched interleaving and chipwise
precoding is proposed to suppress the multiuser interference. The BS and MSs are
also equipped with another codebook of quantized downlink channel-state
information (CSI). Each MS quantizes its own downlink CSI and feeds the index
back to the BS through a low-rate feedback channel. Then, the selection
function at the BS determines the optimum interleaver based on the CSI of all users.
Moreover, a transmit processing technique for the uplink of multiple-antenna
MC-CDMA systems, which requires a very low rate of feedback information, is
also proposed. Codebook design methods for both interleavers and quantized CSI
are also proposed. Simulation results show that the performance of the proposed
techniques is significantly better than prior art.
23.Iterative
Multiuser Detectors for Spatial–Frequency– Time-Domain Spread Multi-Carrier DS-CDMA
Systems
Abstract—
This paper presents three low-complexity, yet effective,
multiuser detectors (MUDs) for the uplink of spatial–frequency– time-domain
(SFT-domain) spread multicarrier (MC) direct- sequence code-division
multiple-access (DS-CDMA) systems. Each MUD first converts a received signal
into the corresponding format. It then detects the transmitted symbols
iteratively by using a one-domain (1D) minimum-mean-square-error (MMSE) detector
and a two-domain (2D) MMSE detector alternately. The weights of the 1D detector
are updated to minimize the mean-square error (MSE) of the detection output by
setting the weights of the 2D detector at the values determined in the previous
iteration, and vice versa. Performance analyses of the bit-error-rate (BER)
bound, the convergence, and the required computational complexity are conducted
to assess the feasibility of the proposed MUDs. Finally, the results of
computer simulations demonstrate that the performance of the proposed schemes
is close to that of the joint SFT-domain MMSE MUD in most scenarios but with
lower computational overheads.
24.Transmit
Precoding for Flat-Fading MIMO Multiuser Systems With Maximum Ratio Combining
Receivers
Abstract—
We examine
the application of transmit precoding in multiuser multi-input–multi-output
(MIMO) communication systems withmaximum ratio combining (MRC) receivers. In
many multiuser applications, the maximum-likelihood or minimum mean-square
error (MMSE) receivers can be prohibitive to implement due to their high
implementation complexity. We examine the performance of the system with simple
MRC receivers and carefully selected precoders, which are designed to
compensate the lack of high-complexity receivers, at the transmitter side. We examine
the sum MSE minimization and signal-to-interference-plus-noise ratio (SINR)
balancing frameworks for the selection of precoders. The performance of two
frameworks with MRC receivers are compared between themselves as well as with
their counterparts implementing MMSE receivers. It has been observed that the
SINR balancing framework with simple MRC receivers has little performance loss
in comparison with the MMSE receivers with a proper selection of precoders.
25.Generalized Space-Time Shift Keying Designed for Flexible
Diversity-, Multiplexing- and Complexity-Tradeoffs
Abstract—
In this paper, motivated by the recent concept
of Spatial Modulation (SM), we propose a novel Generalized Space-Time Shift
Keying (G-STSK) architecture, which acts as a unified Multiple-Input
Multiple-Output (MIMO) framework. More specifically, our G-STSK scheme is based
on the rationale that 𝑃 out of 𝑄 dispersion
matrices are selected and linearly combined in conjunction with the classic
PSK/QAM modulation, where activating 𝑃 out of 𝑄 dispersion
matrices provides an implicit means of conveying information bits in addition
to the classic modem. Due to its substantial flexibility, our G-STSK framework
includes diverse MIMO arrangements, such as SM, Space-Shift Keying (SSK),
Linear Dispersion Codes (LDCs), Space-Time Block Codes (STBCs) and Bell Lab’s
Layered Space- Time (BLAST) scheme. Hence it has the potential of subsuming all
of them, when flexibly adapting a set of system parameters. Moreover, we also
derive the Discrete-input Continuous-output Memoryless Channel (DCMC) capacity
for our G-STSK scheme, which serves as the unified capacity limit, hence
quantifying the capacity of the class of MIMO arrangements. Furthermore,
EXtrinsic Information Transfer (EXIT) chart analysis is used for designing our
G-STSK scheme and for characterizing its iterative decoding convergence.
26.Efficient Iterative Techniques for Soft Decision
Decoding of Reed-Solomon Codes
Abstract—
Two new
iterative soft decision decoding methods for Reed-Solomon (RS) codes are
proposed. These methods are based on bit level belief propagation (BP)
decoding. In order to make BP decoding effective for RS codes, we use an
extended binary parity check matrix with a lower density and reduced number of 4-cycles
compared to the original binary parity check matrix of the code. In the first
proposed method, we take advantage of the cyclic structure of RS codes. Based
on this property, we can apply the belief propagation algorithm on any
cyclically shifted version of the received symbols with the same binary parity
check matrix. For each shifted version of received symbols, the distribution of
reliability values will change and deterministic errors can be avoided. This
method results in considerable performance improvement of RS codes compared to
hard decision decoding. The performance is also superior to some popular soft
decision decoding methods. The second method is based on information correction
in BP decoding. It means that we determine least reliable bits and by changing
their channel information, the convergence of the decoder is improved. Compared
to the first method, this method needs less BP iterations (less complexity) but
its performance is not as good.
27.Computing
Floating-Point Square Roots via Bivariate Polynomial Evaluation
Abstract—
In this paper, we show how to reduce the
computation of correctly rounded square roots of binary floating-point data to
the fixed-point evaluation of some particular integer polynomials in two
variables. By designing parallel and accurate evaluation schemes for such
bivariate polynomials, we show further that this approach allows for high
instruction-level parallelism (ILP) exposure, and thus, potentially low-latency
implementations. Then, as an illustration, we detail a C implementation of our
method in the case of IEEE 754- 2008 binary32 floating-point data (formerly
called single precision in the 1985 version of the IEEE 754 standard). This
software implementation, which assumes 32-bit unsigned integer arithmetic only,
is almost complete in the sense that it supports special operands, subnormal
numbers, and all rounding-direction attributes, but not exception handling
(that is, status flags are not set). Finally, we have carried out experiments
with this implementation on the ST231, an integer processor from the
STMicroelectronics’ ST200 family, using the ST200 family VLIW compiler. The
results obtained demonstrate the practical interest of our approach in that context:
for all rounding-direction attributes, the generated assembly code is optimally
scheduled and has indeed low latency (23 cycles).
28.Systematic Design of RSA Processors Based on High-Radix
Montgomery Multipliers
Abstract—
This paper presents a systematic design approach to
provide the optimized Rivest–Shamir–Adleman (RSA) processors based on
high-radix Montgomery multipliers satisfying various user requirements, such as
circuit area, operating time, and resistance against side-channel attacks. In
order to involve the tradeoff between the performance and the resistance, we apply
four types of exponentiation algorithms: two variants of the binary method
with/without Chinese Remainder Theorem (CRT). We also introduces three
multiplier-based datapath-architectures using different intermediate data
forms: 1) single form, 2) semi carry-save form, and 3) carry-save form, and
combined them with a wide variety of arithmetic components. Their radices are
parameterized from 28 to 2128. A total of 242 datapaths
for 1024-bit RSA processors were obtained for each radix. The potential of the
proposed approach is demonstrated through an experimental synthesis of all
possible processors with a 90-nm CMOS standard cell library. As a result, the
smallest design of 861 gates with 118.47 ms/RSA to the fastest design of 0.67
ms/RSA at 153 862 gates were obtained. In addition, the use of the CRT
technique reduced the RSA operation time of the fastest design to 0.24 ms. Even
if we employed the exponentiation algorithm resistant to typical side-channel
attacks, the fastest design can perform the RSA operation in less than 1.0 ms.
29.State Metric Compression Techniques for Turbo
Decoder Architectures
Abstract—
This papers proposes to compress state metrics in turbo
decoder architectures to reduce the decoder area. Two techniques are proposed:
the first is based on non-uniform quantization and the second on the
Walsh–Hadamard transform followed by non-uniform quantization. The non-uniform quantization
technique reduces state metric memory area of about 50% compared with
architectures where state metric compression is not performed, at the expense
of slightly increasing the error correcting performance floor. On the other
hand, the Walsh–Hadamard transform based solution offers a good tradeoff between
performance loss and memory complexity reduction, which reaches in the best
case 20% of gain with respect to other approaches. Both solutions show lower
power consumption than architectures previously proposed to compress state
metrics.
30.A Low-Power
64-point Pipeline FFT/IFFT Processor for OFDM Applications
Abstract—
4G and other wireless systems are currently
hot topics of research and development in the communication field. Broadband
wireless systems based on orthogonal frequency division multiplexing (OFDM)
often require an inverse fast Fourier transform (IFFT) to produce multiple
subcarriers. In this paper, we present the efficient implementation of a
pipeline FFT/IFFT processor for OFDM applications. Our design adopts a
single-path delay feedback style as the proposed hardware architecture. To
eliminate the read-only memories (ROM’s) used to store the twiddle factors, the
proposed architecture applies a reconfigurable complex multiplier and
bit-parallel multipliers to achieve a ROM-less FFT/IFFT processor, thus
consuming lower power than the existing works. The design spends about 33.6K
gates, and its power consumption is about 9.8mW at 20MHz1.
31.An
Average-Performance-Oriented Subthreshold Processor Self-Timed by Memory Read
Completion
Abstract—
Aself-timed
subthreshold processor was developed in 65-nm complimentary
metal–oxide–semiconductor process. This four-stage reduced instruction set
computer processor synchronously operates with the memory read completion
signal produced in 8.5-kb instruction and 2-kb data memories of subthreshold 10T
static random-access memory. Measurement results show that the processor
correctly functions from 0.56 to 0.36 V with a self-timed clock and achieves
minimum energy per cycle of 3.47 pJ/cycle at 0.46-V supply voltage with
1.76-MHz average frequency. Compared with conventional synchronous operation
with guardbanding, the proposed self-timed operation
reduces the execution time of SHA-1 by 82% at 0.4-V supply voltage and saves energy
by 40% to attain 1-MHz operation.
32.Exploring Area and Delay Tradeoffs in FPGAs With Architecture
and Automated Transistor Design
Abstract—
Field-programmable gate arrays (FPGAs) are
used in a variety of markets that have differing cost, performance and power
consumption requirements. While it would be ideal to serve all these markets
with a single FPGA family, the diversity in the needs of these markets means
that generally more than one family is appropriate. Consequently, FPGA vendors
have moved to provide a diverse set of families that sit at different points in
the areaspeed- power design space. This paper aims to understand the circuit
and architectural design attributes of FPGAs that enable tradeoffs between area
and speed, and to determine the magnitude of the possible tradeoffs. This will
be useful for architects seeking to determine the number of device families in
a suite of offerings, as well as the changes to make between families. We
explore a broad range of architectures and circuit designs and developed a
transistor sizing tool that automatically optimizes each design. In this paper,
we describe this tool and demonstrate that it achieves results that are
comparable to past work but with vastly less effort. We then use the designs
produced by the tool to explore the range of tradeoffs possible. We find that
through architecture and transistor sizing changes it is possible to usefully
vary the area of an FPGA by a factor of 2.0 and the performance of an FPGA by a
factor of 2.1. We also observe that the range of area and delay tradeoffs possible
by varying only the transistor sizing of a single architecture is larger than
the ranges observed in past architectural experiments. In addition to
transistor size, we note that LUT size is one of the most useful parameters for
trading off area and delay.
33.Low power field programmable gate array implementation
of fast digital signal processing algorithms: characterisation and manipulation
of data locality
Abstract:
Dynamic power consumption is very dependent
on interconnect, so clever mapping of digital signal processing algorithms to
parallelised realisations with data locality is vital. This is a particular
problem for fast algorithm implementations where typically, designers will have
sacrificed circuit structure for efficiency in software implementation. This
study outlines an approach for reducing the dynamic power consumption of a
class of fast algorithms by minimising the index space separation; this allows
the generation of field programmable gate array (FPGA) implementations with
reduced power consumption. It is shown how a 50% reduction in relative index
space separation results in a measured power gain of 36 and 37% over a
Cooley–Tukey Fast Fourier Transform (FFT)-based solution for both actual power
measurements for a Xilinx Virtex-II FPGA implementation and circuit
measurements for a Xilinx Virtex-5 implementation. The authors show the generality
of the approach by applying it to a number of other fast algorithms namely the
discrete cosine, the discrete Hartley and the Walsh–Hadamard transforms.
IEEE 2010
34.FPGA
Implementation of
Digital Up/Down Convertor for WCDMA System
Abstract-
In
this paper, we present FPGA implementation of a digital down convertor (DDC)
and digital up convertor (DUC) for a single carrier WCDMA system. The DDC and
DUC is complex in nature. The implementation of DDC is simple
because
it does not require mixers or filters. Xilinx System Generator and Xilinx ISE
are used to develop the hardware circuit for the FPGA. Both the circuits are
verified on the Virtex-4 FPGA.
35.Low-Complexity
Viterbi Decoder for Space-Time Trellis Codes
Abstract—
Space-time trellis code (STTC) has been widely applied
to coded multiple-input multiple-output (MIMO) systems because of its gains in
coding and diversity; however, its great decoding complexity makes it less
promising in chip realization compared to the space-time block code (STBC). The
complexity of STTC decoding lies in the branch metric calculation in the Viterbi
algorithm and increases significantly along with the number of antennas and the
modulation order. Consequently, a low-complexity algorithm to mitigate the
computational burden is proposed. The results show that more than 70%, 78%, and
83% of the computational complexity is reduced for 2x2, 3x3, and 4x4 MIMO configurations, respectively. Based
on the proposed algorithm, a reconfigurable MISO STTC Viterbi decoder is
designed and implemented using 0.18 m 1P6M CMOS technology. The decoder
achieves 11.14 Mbps, 8.36 Mbps, and 5.75 Mbps for 4-PSK, 8-PSK, and 16-QAM
modulations, respectively.
36.An
Energy Efficient Layered Decoding Architecture for LDPC Decoder
Abstract—
Low-density parity-check (LDPC) decoder requires large
amount of memory access which leads to high energy consumption. To reduce the
energy consumption of the LDPC decoder, memory-bypassing scheme has been
proposed for the layered decoding architecture which reduces the amount of
access to the memory storing the soft posterior reliability values. In this work,
we present a scheme that achieves the optimal reduction of memory access for
the memory bypassing scheme. The amount of achievable memory bypassing depends
on the decoding order of the layers. We formulate the problem of finding the
optimal decoding order and propose algorithm to obtain the optimal solution. We
also present the corresponding architecture which combines some of memory
components and results in reduction of memory area. The proposed decoder was
implemented in TSMC 0.18 µm CMOS process. Experimental results show that for a LDPC
decoder targeting IEEE 802.11n specification, the amount of memory access
values can be reduced by 12.9–19.3% compared with the state-of-the-art design.
At the same time, 95.6%–100% hardware utilization rate is achieved.
37.Improving
FPGA Performance for Carry-Save Arithmetic
Abstract—
The selective use of carry-save arithmetic, where appropriate,
can accelerate a variety of arithmetic-dominated circuits. Carry-save
arithmetic occurs naturally in a variety of DSP applications, and further
opportunities to exploit it can be exposed through systematic data flow
transformations that can be applied by a hardware compiler. Field-programmable
gate arrays (FPGAs), however, are not particularly well suited to carry-save arithmetic.
To address this concern, we introduce the “field programmable counter array”
(FPCA), an accelerator for carry-save arithmetic intended for integration into
an FPGA as an alternative to DSP blocks. In addition to multiplication and multiply
accumulation, the FPCA can accelerate more general carry-save operations, such
as multi-input addition (e.g., add k>2 integers) and multipliers that have been
fused with other adders. Our experiments show that the FPCA accelerates a wider
variety of applications than DSP blocks and improves performance, area
utilization, and energy consumption compared with soft FPGA logic.
38.Parallel
Interleavers Through Optimized Memory Address Remapping
Abstract—
This work presents mathematical models and collision-
free exchange rules for a parallel interleaver, using which it develops an
optimized memory address remapping (OPMM) scheme that enables a classic
interleaver to be exchanged for a parallel interleaver readily and efficiently.
Both analytic and experimental results demonstrate that the rate of annealing
achieved using the OPMM approach is much faster than that achieved using the
traditional memory address remapping (MM) method.
39.Design and Implementation of Low-Power ANSI S1.11
Filter Bank for Digital Hearing Aids
Abstract—
Due to well matching the frequency characteristics of human ears, ANSI
S1.11 1/3-octave filter bank is popular in acoustic applications, such as
acoustic analyzers and equalizers. It is also desirable in hearing aids because
the famous hearing aid prescription formula, NAL-NL1, prescribes its gains at
ANSI 1/3-octave frequencies. However, the high computation complexity limits
its usage, in which the power consumption is a critical concern. To address
this issue, a low-power design and implementation of ANSI S1.11 filter bank for
digital hearing aids is present. We first develop the complexity-effective
multirate FIR filter bank algorithm. And, a systematic coefficient design flow
is elaborated for the proposed filter bank to minimize the order of the FIR
filter thereof. In an 18-band digital hearing aid with 24-kHz sampling rate,
the proposed algorithm saves about 96% of multiplications and additions,
comparing that with a straightforward FIR filter bank. Moreover, various
low-power VLSI design techniques are investigated in detail and applied on our
design. The proposed complexity- effective ANSI S1.11 FIR filter bank has been
implemented in the TSMC 0.13- m CMOS technology with an area-efficient
architecture. The test chip consumes only 87 W, which is 30%–79% of that of the
others available in the literature. The proposed lowpower ANSI 1/3-octave bank
makes itself being able to precisely apply the prescribed gains obtained by
NAL-NL1 prescription formula for hearing-impaired people.
40.IEC Control Specification to HDL Synthesis: Considerations
for Implementing PLC on FPGA and Scope for Research
Abstract –
Today’s machine automation systems are demanding
for better throughput, faster response, built in safety features and high speed
communication support, besides satisfying IEC61131-3 control specification.
MEMS sensors & actuators along with increased control logic complexities
are stretching limits of conventional Programmable Logic controllers (PLCs)
generally used for industrial and high end application control. This is because
PLCs are implemented using sequential controllers. Throughput, response time, complex
operation and flexible expansion in such case get limited due to typical fetch,
decode and execute cycle. Field programmable Gate Arrays (FPGAs) which
architecturally can satisfy these requirements, are proposed in the literature as
prospective devices to implement PLCs. However, these devices have not been
commonly accepted by control & automation industry. One of the important
factors is difficulty in understanding FPGA design flow and its design specification
standard. Currently, there is no proven interface available in the market to
bridge between IEC control specification and FPGA design specification. Authors
of this paper have thought over of developing such interface to open up
benefits from VLSI design to control and automation domain. The issues and
considerations for developing such interface are discussed in this paper. As
this is multidisciplinary work, and not much work has been done in this area,
possible research opportunities are discussed along with validation platform
considerations.
41.Reducing
SRAM Power Using Fine-Grained Wordline Pulsewidth Control
Abstract—
EmbeddedSRAMdominates modern SoCs, and there is a
strong demand forSRAMwith lower power consumption while achieving high
performance and high density. However, the large increase of process variations
in advanced CMOS technologies is considered one of the biggest challenges for
SRAM designers. In the presence of large process variations, SRAMs are expected
to consume larger power to ensure correct read operations and meet yield
targets. In this paper, we propose a new architecture that significantly
reduces the array switching power for SRAM. The proposed architecture combines
built-in self-test and digitally controlled delay elements to reduce the
wordline pulsewidth for memories while ensuring correct read operations, hence
reducing the switching power. Monte Carlo simulations using a 1-Mb SRAM macro
in an industrial 45-nm technology are used to verify the power saving for the
proposed architecture. For a 48-Mb memory density, a 27% reduction in array
switching power can be achieved for a read access yield target of 95%. In
addition, the proposed system can provide larger power saving as process
variations increase, which makes it an attractive solution for 45-nm-and-below technologies.