A low power memoryless ROM design architecture for a direct digital frequency synthesizer

Salah ALKURWY¹,*, Sawal ALI², Shabiul ISLAM³, Faizul IDROS⁴

¹Department of Electronics Engineering, College of Engineering, Diyala University, Diyala, Iraq
²Department of Electrical, Electronics and System Engineering, Faculty of Engineering, National University of Malaysia, Bangi, Malaysia
³Institute of Microengineering and Nanoelectronics (IMEN), National University of Malaysia, Bangi, Malaysia
⁴Faculty of Engineering, MARA University of Technology, Shah Alam, Malaysia

Abstract: This paper presents a novel, memoryless, read-only memory (ROM) design architecture for a direct digital frequency synthesizer (DDFS). A pipelining technique is proposed to increase the phase accumulator (PA) throughput. However, this technique increases the number of registers as the pipeline stages increase. The shifted clocking technique is used to reduce the pipelined PA registers. The wave symmetry technique is applied to store ($0: \pi/2$) of the sine wave. The ROM is partitioned into three four-bit sub-ROMs based on the angular decomposition technique and trigonometric identity. A novel approach of memoryless ROM design technique is proposed and implemented in the design of a 24-bit DDFS system that replaces the conventional ROM. Replacing the memoryless sub-ROM circuits, instead of the conventional 12-bit ROM, reduces power consumption and area dimension. As a result, compared to the conventional ROM circuit, the values of area dimension and dynamic power are reduced by 15% and 14.8%, respectively.

Key words: Phase accumulator, carry look-ahead adder, direct digital frequency synthesizer, read-only memory

1. Introduction
The sinusoidal waveform generated by a direct digital frequency synthesizer (DDFS) has many advantages compared to the analogue phase-locked loop. These advantages include high-frequency resolution, high-speed frequency channels, and high spectral purity. The first direct digital system, designed by Tierney [1], consisted of a phase accumulator (PA) that generated ($0 - 2\pi$) digital phase values. A read-only memory (ROM) or phase-to-amplitude converter (PAC) is used to provide the sinusoidal amplitude waveform. Then a digital-to-analogue converter (DAC) with a low-pass filter is used to generate the analogue sinusoidal signal.

Several techniques were used to improve PA performance. Achieving high resolution and high speed throughput for the DDFS system requires a high-speed PA. Hence, the PA, which would consist of D flip flop (DFF) registers and adders, taking in consideration the use of fast adders as well as applying the pipelining techniques as in [2,3], will provide the desired requirements. Several solutions have proposed the pipelining design technique to improve speed performance and reduce complexity for the PA [4]. However, this technique has the disadvantage of increasing the number of registers as the pipeline stages increase. Therefore, the shifted clock technique is proposed in [5] to reduce the number of repetitive registers while preserving high-speed operation.

*Correspondence: salahalkurwy@engineering.uodiyala.edu.iq
ROM implementation uses an angular decomposition technique, based on the segmentation of the quarter phase angle \((0: \pi/2)\) into small blocks, namely coarse and fine ROMs. Trigonometric identities and approximations are used to reduce the required stored data of the suggested blocks. A \((0: \pi/2)\) phase angle is obtained by assuming the coarse and fine phase angles. Two segmented angles, coarse (A) and fine (B), based on trigonometric identities with simple multiplication, were used by Sunderland [6] and Hutchinson [7] to provide the quarter sine amplitude waveform.

In this research, we propose a DDFS system with PA, based on a carry look-ahead adder (CLA). The shifted clocking technique is used in the PA design to reduce the number of registers. The memoryless logic gates blocks were designed based on the seven-segment display technique instead of conventional sub-ROM blocks.

2. Phase accumulator architecture

Pipeline technique is used to increase the throughput of the output frequency. The preskewing D flip-flop (DFF) registers are used to synchronize the frequency control word (FCW) input with the carry-input in each pipelining stage. However, this technique increases the number of registers as the pipeline stages increase. A PA with multiple pipeline stages increases the number of registers, as shown in Figure 1. This increase will lead to higher power consumption. Therefore, the shifted clocking technique is used to reduce the number of registers while preserving the high-speed operation.

The shifted clocking technique uses DFFs to connect each row of pipeline stages. These DFFs are clocked by the pipelined pulses with one clock cycle (Figure 2) and control the FCW input registers in the stages.

Figure 3 shows the new architecture of the 24-bit PA with the CLA and shifted clock technique. Considering that \(N\) is the PA input bit and that the PA is partitioned into \(L\) stages with \(R\) flip flops registers in each stage (Figure 1), the preskewing DFF registers \(R\) can be expressed as follows:

\[
R = \lceil N(L + 1) \rceil / 2
\]  

(1)

Applying the shifted clocking method to the proposed design, the preskewing registers \(R\) can be expressed as
follows:

\[ R = N + L \]  \hspace{1cm} (2)

As a result, the number of preskewing registers R decreases by 43.7\% (Figure 3).
2.1. The proposed carry-look ahead adder

The adder is the key element of the PA. Therefore, the fast adder improves the performance of the accumulator. In this PA design, the basic CLA concept is explained as follows: the carry generate can produce the carry-out function of two input bits when both inputs are equal to 1, regardless of the input carry. The carry propagate is associated with the propagation of the carry input to generate the carry-out [8]. Therefore, the carry-out function of the CLA, used in each pipeline stage, can be quickly determined by a value of 0 or 1 at each stage. Consequently, the result can be achieved more rapidly. The carry-out functions of stage N are obtained from the following equation:

\[ C_{N+1} = g_N + p_N C_N, \]  

where \( g_N = x_N y_N \) and \( @P_N = x_N y_N \) are the carry-generate and carry-propagate functions, respectively, and N are the input bits.

The conventional eight-bit CLA logic circuit required an 80-logic gates circuit, whereas the proposed adder needed only 47 logic gates. The critical path of the proposed adder was achieved with 7 gate delays, as shown in Figure 4.

\[ C_{out} = g_7 + p_7(g_6 + p_6(g_5 + p_5 g_4 + p_5 p_4(g_3 + P_3(g_3 + P_2(g_1 + P_1 g_0 + P_1 p_0 C_{in})))))) \]  

Figure 4. Block circuit diagram of eight-bit carry look-ahead adder.

3. ROM look-up table design

The ROM LUT is a storage of memory addresses that is used to assign the phase output to an amplitude sine wave signal. To achieve a high-accuracy sine waveform for DDFS, a large ROM size is required. Reducing the ROM size area while maintaining high performance is always the goal of the designers. A simple technique used for ROM resizing is the quarter-wave symmetry technique, at which one-quarter (0: \( \pi / 2 \)) of the sine waveform is
stored in the ROM, and the highest two most significant bits (MSB) from the PA output are used to reconstruct the full sine wave. The phase output is used in the first and third quadrants, whereas the inverted values of the phase output are needed for the second and fourth quadrants. These inverted values are achieved by using the two’s complement method when the phase is between ($\pi$: $2\pi$). To meet the design goal of saving power and keeping less area, counterbalance can be achieved by adding one-half of the least significant bits (LSB) offset to the stored memory address of sub-ROMs. With this offset, the two’s complement full adder hardware is removed from the proposed design. The angular decomposition technique, based on the trigonometric identity technique, is used in the proposed design to reduce ROM size. The ROM is partitioned into three four-bit sub-ROMs, namely A, B, and C, in a way that $A < (\pi/2)$, $B < (\pi/2) (1/2^A)$, and $C < (\pi/2) (1/2^{A+B})$.

The sine wave function, based on the trigonometric relation and the relative sizes of $A$, $B$, and $C$, may be approximated as follows:

$$
\sin(A + B + C) = \sin(A + B) + \cos A \sin C = \sin A + \cos A \sin B + \cos A \sin C \[6\] 
$$

(5)

$\cos A$ values can be obtained by connecting the complement of $\sin A$ values and logic high (Vcc) to the XOR logic gates input [9]. In this way, only one addressing sub-ROM is needed for $\sin A$ and $\cos A$ values.

The proposed ROM LUT design with an angular decomposition technique and three four-bit ROMs requires only 368 D flip-flops storing registers ($176 + 128 + 80$) DFF-bits, for ROM- (A, B, and C), respectively, with 534.2:1 compressed ratio [9]. Two adders, two multiplier adders, and XOR gates are adopted as additional hardware equipment to accomplish the ROM circuit design.

The proposed compressed ROM LUT consists of three segments of four-bit sub-ROM blocks (A, B, and C). The required stored values in these sub-ROMs are calculated as follows:

$$
\sin A = (2^{A+B+C} - 1) \sin \left( \frac{\pi}{2} \left[ \frac{0 : (2^A - 1)}{2^A} \right] \right) \Rightarrow (2^{12} - 1) \sin \left( \frac{\pi}{2} \left[ \frac{0 : 15}{16} \right] \right) 
$$

(6)

$$
\sin B = (2^{A+B+C} - 1) \sin \left( \frac{\pi}{2} \left[ \frac{0 : (2^B - 1)}{2^B} \right] \right) \left( \frac{1}{2^A} \right) 
$$

$$
\Rightarrow (2^{12} - 1) \sin \left( \frac{\pi}{2} \left[ \frac{0 : 15}{16} \right] \right) \left( \frac{1}{16} \right) 
$$

(7)

$$
\sin C = (2^{A+B+C} - 1) \sin \left( \frac{\pi}{2} \left[ \frac{0 : (2^C - 1)}{2^C} \right] \right) \left( \frac{1}{2^{A+B}} \right) 
$$

(8)

$$
\Rightarrow (2^{12} - 1) \sin \left( \frac{\pi}{2} \left[ \frac{0 : 15}{16} \right] \right) \left( \frac{1}{256} \right) 
$$

The calculated values of $\sin(A), \sin(B),$ and $\sin(C)$ show that each $\sin C$ value is limited between two $\sin B$ series values. Similarly, $\sin B$ values are limited between two $\sin A$ series.

3.1. Proposed memoryless circuit design

The basic seven-segment display operates as follows: BCD counter is used to provide (0–9) numbers and is required to identify the seven-column line output. The segment display contains seven columns (a, b, c, d, e, f,
and g) that are used to represent all the sections. For instance, a decimal number one can be represented by two sections (b and c). Therefore, the two sections b and c will be logic 1, and the rest will be logic 0 (0110000).

Based on the explanation of the seven-segment display, the suggested memoryless ROM circuit is designed. The created values of the \( (0 : \pi/2) \) sine wave (based on Eqs. (6)–(8)) are used as a counter of the combinational logic circuit, and are listed as binary digits in rows \( (0:2^4 – 1) \). The required ROM output bits and the listed columns in lines \( (X_n, X_{n-1}, \ldots, X_0) \) are used to create the logic circuits, according to the binary digits rows, as shown in Figure 5.

<table>
<thead>
<tr>
<th>Xn</th>
<th>Xn-1</th>
<th>X3</th>
<th>X2</th>
<th>X1</th>
<th>X0</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

\[ A_{10} = X_0 + X_1X_2 \] (9)

\[ A_9 = X_1\overline{X}_2 + X_0X_3 + X_2(X_0 + \overline{X}_1X_3), \] (10)

where \( X_3 : X_0 \) are the required \( \sin A \) output bits along with the listed columns in lines.

A Karnaugh map is used to simplify the created logic circuits. The created logic circuits must be equal to the number of listed columns. These circuits were used to provide the desired digital value of the suggested \( (0: \pi/2) \) sine wave with each clock pulse. Therefore, no memory storages (registers and multiplexers) are needed to store the ROM values.

The same principle is used for the rest of the other \( \sin A \) equations. The \( (B_7 : B_0) \sin B \) and the \( (C_3 : C_0) \) \( \sin C \) equations are similarly generated.

The created memoryless ROM of \( \sin (A, B, \text{and} C) \) blocks replace the conventional three sub-ROMs blocks, and are implemented with the proposed compressed ROM design blocks, as shown in Figure 6.

4. Hardware implementation of high-speed DDFS

The proposed DDFS design system (consisting of a 24-bit PA and the memoryless circuit) was coded in Verilog hardware description language (HDL) and successfully simulated in ALTERA Quartus II software, a system...
Figure 6. Application of novel memoryless ROM blocks of sin A, sin B, and sin C in the proposed 12-bit compressed ROM.

with an operation speed of 144.22 MHz. Then it was programmed on a Cyclone III FPGA kit board. The programmed FPGA kit board was connected to the high-speed throughput DAC circuit, using a high-speed mezzanine card (HSMC) to involve Dual 14-bit with high-speed D/A conversion of 250 mega-samples per second (MSPS), and verified by waveform and spectrum analyzers. The measured results of the frequency waveform and the spectrum are consistent with the expected results.

The output frequency ($f_{out}$) is mathematically calculated with the equation: $f_{out} = \left(\frac{FCW \times f_{clk}}{2^N}\right)$, where $N = 24$ (for 24-bit PA), $FCW = 0EFFFF$ (in hexa format), and 125 MHz (clock frequency of the Cyclone III FPGA kit board). The measured output frequency is calculated by $f_{out} = \frac{FCW \times f_{clk}}{2^N} = \frac{0EFFFF \times 125 \times 10^6}{2^{24}} = 7.324 MHz$. The measured output frequency of the sine wave signal, shown in Figure 7, is 7.35 MHz, which closely matches the calculated value.

Figure 7. Measured image of the sine wave signal for high-speed DFSS.

The measured signal-to-noise ratio (SNR) for DDFS waveform output is 68 dB, as shown in Figure 8, which is 4 dB less than the average values of 12-bit ROM DDFS output (72 dB). The disadvantage in the result stems from the angular decomposition technique and the wire connection between the FPGA board and the spectrum analyzer.

The novel approach of designing the sub-ROM blocks, based on memoryless ROM, is realized by ASIC using Synopsys software. The comparison results give a benefit percentage of area dimension for sin A,
sin \(B\), and sin \(C\) blocks of 69%, 66.5%, and 51.5%, respectively. Replacing the memoryless ROM blocks of \(\sin(A, B, C)\), instead of the conventional sub-ROM blocks, reduces the number of slices and the area dimension up to 22% and 15%, respectively. In terms of power consumption, the proposed architecture consumes less power for cell leakage power (18%) and dynamic power (14.8%).

The comparison results of conventional memoryless ROM blocks in a number of cells, nets, and area dimension of \(\sin(A, B, C)\) blocks are shown in Table 1.

Table 1. Conventional memoryless ROM comparison.

<table>
<thead>
<tr>
<th></th>
<th>Cells</th>
<th>Nets</th>
<th>Area ((\mu)m²)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sine A</td>
<td>Conventional ROM</td>
<td>77</td>
<td>83</td>
</tr>
<tr>
<td></td>
<td>Memoryless ROM</td>
<td>40</td>
<td>44</td>
</tr>
<tr>
<td>Benefit reduction</td>
<td>48%</td>
<td>47%</td>
<td>69%</td>
</tr>
<tr>
<td>Sine B</td>
<td>Conventional ROM</td>
<td>49</td>
<td>55</td>
</tr>
<tr>
<td></td>
<td>Memoryless ROM</td>
<td>31</td>
<td>35</td>
</tr>
<tr>
<td>Benefit reduction</td>
<td>36.7%</td>
<td>36.3%</td>
<td>66.5%</td>
</tr>
<tr>
<td>Sine C</td>
<td>Conventional ROM</td>
<td>27</td>
<td>33</td>
</tr>
<tr>
<td></td>
<td>Memoryless ROM</td>
<td>21</td>
<td>25</td>
</tr>
<tr>
<td>Benefit reduction</td>
<td>22%</td>
<td>24%</td>
<td>51.1%</td>
</tr>
</tbody>
</table>

Table 1 shows that novel memoryless ROM has less area than conventional sub-ROM in 69%, 66.5%, and 51.5%, and a lower number of cells in 48%, 36.7%, and 22% for sin (A, B, and C) blocks, respectively. The power comparison results show that the dynamic and cell leakage power of memoryless ROM have less area than conventional sub-ROM in 14.8% and 18%, respectively, as shown in Table 2.

The comparison between ROM size of the present and previous approaches of DDFS works is shown in Table 3. The comparison was based on the amplitude phase bit, ROM technique, ROM size, spurious-free dynamic range (SFDR), and additional hardware components. Quarter-wave symmetry and compressed ROM techniques were applied to all the listed ROM designs in the table.

The results show that the proposed memoryless ROM only has 161 logic gates (76, 57, and 28 logic gates...
Table 2. ROM blocks memoryless ROM comparison.

<table>
<thead>
<tr>
<th>12-bit compressed ROM</th>
<th># Cells</th>
<th>Area ($\mu m^2$)</th>
<th>Cell leakage power (nW)</th>
<th>Dynamic Power ($\mu W$)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Conventional ROM</td>
<td>610</td>
<td>19,589.16</td>
<td>917.9225</td>
<td>176.0596</td>
</tr>
<tr>
<td>Memoryless ROM</td>
<td>520</td>
<td>16,658.61</td>
<td>753.3794</td>
<td>150.0412</td>
</tr>
<tr>
<td>Benefit reduction</td>
<td>22%</td>
<td>15%</td>
<td>18%</td>
<td>14.8%</td>
</tr>
</tbody>
</table>

Table 3. Comparison between present ROM size and previous approaches of DDFS works.

<table>
<thead>
<tr>
<th>Ref.</th>
<th>PA (bit)</th>
<th>Ampl. phase (bit)</th>
<th>ROM technique</th>
<th>ROM size (bit)</th>
<th>SFDR</th>
<th>Additional hardware</th>
</tr>
</thead>
<tbody>
<tr>
<td>[6]</td>
<td>20</td>
<td>12</td>
<td>Angular decomposition</td>
<td>3328</td>
<td>72</td>
<td>2 adders + 2 multipliers</td>
</tr>
<tr>
<td>[10]</td>
<td>28</td>
<td>12</td>
<td>Angular decomposition</td>
<td>832</td>
<td>84</td>
<td>2 adders/subtractors + 2 multipliers</td>
</tr>
<tr>
<td>[11]</td>
<td>32</td>
<td>12</td>
<td>Quad line approximation (QLA)</td>
<td>2176</td>
<td>NA</td>
<td>5 adders + 1 MUX + Complementer</td>
</tr>
<tr>
<td>[12]</td>
<td>24</td>
<td>12</td>
<td>Piecewise-polynomial approximation</td>
<td>480</td>
<td>83.6</td>
<td>Adders tree + multipliers</td>
</tr>
<tr>
<td>[14]</td>
<td>32</td>
<td>12</td>
<td>ROM-less</td>
<td>NA</td>
<td>42</td>
<td>DAC (LTC 2624)</td>
</tr>
<tr>
<td>[15]</td>
<td>32</td>
<td>18</td>
<td>Parabolic polynomial interpolation</td>
<td>NA</td>
<td>68.6</td>
<td>2 multipliers + 2 adders + 4 Mux + shift selector</td>
</tr>
<tr>
<td>[16]</td>
<td>32</td>
<td>12</td>
<td>Angular decomposition</td>
<td>368</td>
<td>68</td>
<td>2 multipliers + 2 adders + 3 DFF registers</td>
</tr>
<tr>
<td>[17]</td>
<td>24</td>
<td>10</td>
<td>Complementary dual-phase latch</td>
<td>NA</td>
<td>45.3</td>
<td>2 ROM + 3 level AND gate in each stage</td>
</tr>
<tr>
<td>[18]</td>
<td>32</td>
<td>9</td>
<td>Piecewise linear interpolation</td>
<td>NA</td>
<td>46</td>
<td>8 × 6 and 8 × 3 ROMs + thermo decoder</td>
</tr>
<tr>
<td>This work</td>
<td>24</td>
<td>12</td>
<td>Angular decomposition</td>
<td>0</td>
<td>68</td>
<td>2 multipliers + 2 adders + 3 DFF Reg. + 161 logic gates</td>
</tr>
</tbody>
</table>

*In [16] and the present work, the measured DDFS output waveform is in signal-to-noise ratio (SNR).

for sin($A, B, C$), respectively). The proposed work uses a 24-bit pipelined PA to achieve high-frequency resolution. Quarter-wave symmetry and angular decomposition techniques are used to reduce ROM size.

Novel memoryless ROM is used to replace the conventional sub-ROM blocks for 12-bit PAC. This novel technique removes all the required registers and multiplexers for the sub-ROM design circuits and replaces them with few AND, OR, and XOR logic gates.

The average sub-ROM size is reduced by 65.15%, and the complete PAC area is reduced to 15% for the compressed 12-PAC. Moreover, the dynamic and cell leakage power were reduced to 14.8% and 18%, respectively.

5. Conclusion

This paper presented a memoryless ROM design architecture for DDFS. Replacing the sin ($A, B, C$) memoryless ROM, instead of the conventional sub-ROM blocks of the developed 12-bit compressed ROM, will save 22% of cells, 15% area dimension, 14.8% dynamic power, and 18% cell leakage power. The presented 24-bit
DDFS design system was coded in Verilog HDL, programmed on Cyclone III FPGA kit board and connected with DAC. The complete DDFS system was designed and verified with waveform and spectrum analyzers. The improved performance of the PA and the reduced ROM size provide the present DDFS design with more flexibility for application in wireless communication systems.

References