## Design of Energy/Quality Scalable Hardware By Runtime Voltage Scaling and Back Biasing

## Daniele Jahier Pagliari

EDA Group<br>Politecnico di Torino<br>Torino, Italy

TECHNOLOGY
RESEARCH
INSTITUTE

2nd IWES
September 8th , 2017, Rome, Italy

## The EDA Group

- Electronic Design Automation
- 7 Faculty members
- Enrico Macii, Massimo Poncino, Alberto Macii, Andrea Acquaviva
- Elisa Ficarra, Andrea Calimera , Santa Di Cataldo, Sara Vinco
- 4 post-doc researchers
- ~10+ Ph.D. students \& Research Assistants
- Three main areas of research:
- EDA (energy efficiency, EES, etc.)
- Technologies for Smart Cities (Buildings, Districts, etc.)
- Bioinformatics
- Strong record of EU funded projects
- 30+ in the last 10 years.


## Outline

- Introduction
- Background and Motivation
- Dynamic $\mathrm{V}_{\mathrm{DD}} / \mathrm{V}_{\mathrm{BB}} /$ Accuracy Tuning
- Experimental Results
- Conclusions and Future Work


## Introduction

- IoT devices trends:

Performance Demands

Energy Budget

## Introduction

- IoT devices trends:



## Energy Budget

- Many emerging applications are error tolerant ( or error resilient):
- Recognition, Mining and Synthesis (RMS) domains


## Introduction

- IoT devices trends:


Performance Demands

Energy Budget


- Many emerging applications are error tolerant ( or error resilient):
- Recognition, Mining and Synthesis (RMS) domains


## Introduction

- IoT devices trends:

Performance Demands

## Energy Budget



- Many emerging applications are error tolerant ( or error resilient):
- Recognition, Mining and Synthesis (RMS) domains

Approximate Paradigm: Tradeoff energy consumption and output quality leveraging applications error tolerance.

## Introduction

- IoT devices trends:



## Energy Budget



Approximate Paradigm: Tradeoff energy consumption and output quality leveraging applications error tolerance.

## - Two main approaches:

- Design-time Approximations
- Quality-Configurable Systems (QCS)


## Background - Functional Units

1. Approximate circuits:

- Mostly adders and multipliers

Kyaw, Goh and Yeo, EDSSC'10, Huang, Lach and Robins, DAC'12, Farshchi, Saeed and Fakhraie, CADS'13, Jiang, Han and Lombardi, GLSVLS|'15, Bhardwaj, Mane and Henkel, ISQED'15, etc.
2. Approximate synthesis:

- Generalization of the previous techniques to any netlist
Shin and Gupta, ATS'08,
Venkataramani et al, DAC'12, Miao,
Gerstlauer and Orshansky, ICCAD'13, Jahier Pagliari et al, ICCD'15,etc.

3. Quality-configurable circuit architectures:

- Arithmetic units

De la Guya Solaz, Han, Conway,
IEEE TCAS'11, Kahng and Kang,
DAC'12, Ye et al, ICCAD'13, Liu, Han and Lombardi, DATE'14, etc.

- Voltage scalable meta-functions

Mohapatra, Chippa, Raghunathan and Roy, DATE'11

## 4. Dynamic Voltage and Accuracy Scaling (DVAS):

- Use technological knobs only (no design modifications)
Moons and Verhelst, ISLPED'15, Moons et al, ISSCC'17, etc.


## Background - DVAS

Input Dynamic (Bit-width) $\downarrow$

## Background - DVAS

Input Dynamic (Bit-width) $\downarrow$

Switching Activity $\downarrow$

## Background - DVAS



## Background - DVAS



## Background - DVAS



## Background - DVAS



## Background - DVAS




## Background - DVAS



## Motivation

- Main Limitation of DVAS:
"Wall-of-Slack" phenomenon:
- Synthesis optimizes long paths for timing, short ones for area and power
- Most paths become "almostcritical"


## Motivation

- Main Limitation of DVAS: "Wall-of-Slack" phenomenon:
- Synthesis optimizes long paths for timing, short ones for area and power
- Most paths become "almostcritical"
- Example:
- Booth multiplier endpoint histogram.



## Motivation

- Main Limitation of DVAS: "Wall-of-Slack" phenomenon:
- Synthesis optimizes long paths for timing, short ones for area and power
- Most paths become "almostcritical"
- When $V_{D D}$ is scaled the number of usable bits decreases rapidly
- Example:
- Booth multiplier endpoint histogram.



## Motivation

- Main Limitation of DVAS: "Wall-of-Slack" phenomenon:
- Synthesis optimizes long paths for timing, short ones for area and power
- Most paths become "almostcritical"
- When $V_{D D}$ is scaled the number of usable bits decreases rapidly
- Example:
- Booth multiplier endpoint histogram.


Useful bit-width configurations require $\mathrm{V}_{\mathrm{DD}} \cong \mathrm{V}_{\mathrm{DD}, \mathrm{NOM}}$

## Motivation (cont’d)

Contrasting the "Wall of Slack":

- Solution 1: modify synthesis constraints.
- Overhead in area and power at maximum accuracy.


## Motivation (cont'd)

## Contrasting the "Wall of Slack":

- Solution 1: modify synthesis constraints.
- Overhead in area and power at maximum accuracy.
- Solution 2: finer-grain power/delay tuning
- Key: in reduced accuracy "modes", not all paths of the circuit require the same "speed"



## Fine-grain power/delay tuning

- Possible solution: multiple VDD
- Requires level shifters
- Excessive power overheads for a single FU


## Fine-grain power/delay tuning

- Possible solution: multiple VDD
- Requires level shifters
- Excessive power overheads for a single FU
- Our solution: combine DVAS with FDSOI's Back Bias


## Fine-grain power/delay tuning

- Possible solution: multiple VDD
- Requires level shifters
- Excessive power overheads for a single FU
- Our solution: combine DVAS with FDSOI's Back Bias

NMOS
PMOS


BULK


## Fine-grain power/delay tuning

- Possible solution: multiple VDD
- Requires level shifters
- Excessive power overheads for a single FU
- Our solution: combine DVAS with FDSOI's Back Bias

- Fine-grain threshold voltage $\left(\mathrm{V}_{\mathrm{th}}\right)$ tuning in addition to $\mathrm{V}_{\mathrm{DD}}$ assignment


## Fine-grain power/delay tuning

- Possible solution: multiple VDD
- Requires level shifters
- Excessive power overheads for a single FU
- Our solution: combine DVAS with FDSOI's Back Bias



## Advantages:

- Fine-grain speed/power control
- $\mathrm{V}_{\mathrm{DD}}$ possibly shared with other FUs
- No level shifters; Well insulation trenches (area overhead only)
assignment


## Dynamic VDD/VBB/Accuracy Tuning

Issue with $\mathbf{V}_{\mathrm{BB}}$ assignment:

- Cannot apply independent $\mathrm{V}_{B B}$ to each cell
- Partition in $\mathrm{V}_{\mathrm{BB}}$ domains is required


## Dynamic VDD/VBB/Accuracy Tuning

## Issue with $\mathrm{V}_{\mathrm{BB}}$ assignment:

- Cannot apply independent $\mathrm{V}_{\text {BB }}$ to each cell
- Partition in $\mathrm{V}_{\mathrm{BB}}$ domains is required



## Dynamic VDD/VBB/Accuracy Tuning

## Issue with $\mathrm{V}_{\mathrm{BB}}$ assignment:

- Cannot apply independent $V_{B B}$ to each cell
- Partition in $\mathrm{V}_{\mathrm{BB}}$ domains is required



## Proposed partitioning: Regular Tiling



Original Trench placement


Placement with $V_{B B}$ Domains

## Dynamic VDD/VBB/Accuracy Tuning

## Issue with $\mathrm{V}_{\mathrm{BB}}$ assignment:

- Cannot apply independent $\mathrm{V}_{\text {BB }}$ to each cell
- Partition in $\mathrm{V}_{\mathrm{BB}}$ domains is required



## Proposed partitioning:

 Regular Tiling Original Trench placement


Insulation
Placement with $V_{B B}$ Domains

## Pros:

- Regularity of design
- Easy to incorporate in EDA flow
- Minimal displacement of cells


## Dynamic VDD/VBB/Accuracy Tuning

## Issue with $\mathrm{V}_{\mathrm{BB}}$ assignment:

- Cannot apply independent $\mathrm{V}_{\text {BB }}$ to each cell
- Partition in $\mathrm{V}_{\mathrm{BB}}$ domains is required



## Proposed partitioning:

 Regular Tiling

Insulation Original Trench placement

Placement with $V_{B B}$ Domains

## Pros:

- Regularity of design
- Easy to incorporate in EDA flow
- Minimal displacement of cells

Minimal timing, area and power overheads at maximum accuracy.

## Experimental Results

## Designs:

- Booth multiplier
- FFT Butterfly unit
- 30-tap FIR filter
- 16-bit fixed-point implementations


## Experimental Results

## Designs:

- Booth multiplier
- FFT Butterfly unit
- 30-tap FIR filter
- 16-bit fixed-point implementations


## Operating Conditions:

- $\mathrm{V}_{\mathrm{DD}}=[0.6 \mathrm{~V}, 0.7 \mathrm{~V}, \ldots 1.0 \mathrm{~V}]$
- Forward BB: $\mathrm{V}_{\mathrm{BB}}= \pm 1.1 \mathrm{~V}(\mathrm{~N}-$ Well/P-Well)


## Experimental Results

## Designs:

- Booth multiplier
- FFT Butterfly unit
- 30-tap FIR filter
- 16-bit fixed-point implementations


## Operating Conditions:

| Design | Area <br> $\left[\mathrm{mm}^{2}\right]$ | Clock <br> Freq. <br> $[\mathrm{GHz}]$ | \# of $\mathrm{V}_{\mathrm{BB}}$ <br> Domains |
| :---: | :---: | :---: | :---: |
| Booth | $2.59 \mathrm{e}-03$ | 1.25 | $2 \times 2$ |
| Butterfly | $7.71 \mathrm{e}-03$ | 1.00 | $3 \times 3$ |
| FIR | $9.10 \mathrm{e}-03$ | 0.75 | $3 \times 3$ |

- $\mathrm{V}_{\mathrm{DD}}=[0.6 \mathrm{~V}, 0.7 \mathrm{~V}, \ldots .1 .0 \mathrm{~V}]$
- Forward $\mathrm{BB}: \mathrm{V}_{\mathrm{BB}}= \pm 1.1 \mathrm{~V}(\mathrm{~N}-$ Well/P-Well)


## Comparison with DVAS

## Booth Multiplier

- Plots: Minimum power configuration for each accuracy
- Combining (global) $\mathrm{V}_{\mathrm{DD}}$ scaling and fine-grain back-biasing



## Comparison with DVAS

## Booth Multiplier

- Plots: Minimum power configuration for each accuracy
- Combining (global) $\mathrm{V}_{\mathrm{DD}}$ scaling and fine-grain back-biasing
- Comparison:
- DVAS with No Back Biasing (NoBB)
- DVAS with FBB in the entire circuit



## Comparison with DVAS

## Booth Multiplier

- Plots: Minimum power configuration for each accuracy
- Combining (global) $V_{D D}$ scaling and fine-grain back-biasing
- Comparison:
- DVAS with No Back Biasing (NoBB)
- DVAS with FBB in the entire circuit

- 32.7\% Saving w.r.t. DVAS @ 10-bit accuracy!


## Comparison with DVAS

## FIR Filter

-"Wall-of-Slack" clearly visible

- Maximum DVAS + FBB accuracy (without violations):
- 15-bit @ 0.9V
- Only 4-bit @ 0.8V!



## Comparison with DVAS

## FIR Filter

- "Wall-of-Slack" clearly visible
- Maximum DVAS + FBB accuracy (without violations):
- 15-bit @ 0.9V
- Only 4-bit @ 0.8V!



## Comparison with DVAS

## FIR Filter

- "Wall-of-Slack" clearly visible
- Maximum DVAS + FBB accuracy (without violations):
- 15-bit @ 0.9V
- Only 4-bit @ 0.8V!
- 39.9\% Saving w.r.t. DVAS @ 10-bit accuracy!



## Comparison with DVAS

## FFT Butterfly

- Large number of $\mathrm{V}_{\mathrm{BB}}$ domains $(3 \times 3)$ compared to relatively small circuit area
- Power overheads more significant
- Also, "Wall-of-Slack" less visible (circuit probably under constrained)
- Still 16.5\% saving w.r.t. DVAS @ 8-bit!


## Impact of $\mathrm{V}_{\mathrm{BB}}$ Domains

- Number of $\mathrm{V}_{\mathrm{BB}}$ domains vs power saving (Booth Mul.):



## Impact of $\mathrm{V}_{\mathrm{BB}}$ Domains

- Number of $\mathrm{V}_{\mathrm{BB}}$ domains vs power saving (Booth Mul.):

- Number of $\mathrm{V}_{\mathrm{BB}}$ domains vs overheads (Booth Mul.):



## Conclusions and Future Work

## Conclusions:

- Back-Bias is an effective knob for fine-grain delay/power tuning in quality-configurable functional units.


## Conclusions and Future Work

## Conclusions:

- Back-Bias is an effective knob for fine-grain delay/power tuning in quality-configurable functional units.
- Combined with global $\mathrm{V}_{\mathrm{DD}}$ scaling, this method
overcomes the limitations of DVAS, by contrasting the "Wall-of-slack" phenomenon.


## Conclusions and Future Work

## Conclusions:

- Back-Bias is an effective knob for fine-grain delay/power tuning in quality-configurable functional units.
- Combined with global $\mathrm{V}_{\mathrm{DD}}$ scaling, this method
overcomes the limitations of DVAS, by contrasting the "Wall-of-slack" phenomenon.
- First ever application of Back-Biasing to Quality Configurable Systems (to our knowledge).


## Conclusions and Future Work

## Conclusions:

- Back-Bias is an effective knob for fine-grain delay/power tuning in quality-configurable functional units.
- Combined with global $\mathrm{V}_{\mathrm{DD}}$ scaling, this method overcomes the limitations of DVAS, by contrasting the "Wall-of-slack" phenomenon.
- First ever application of Back-Biasing to Quality Configurable Systems (to our knowledge).


## Future Developments:

- Devise method for runtime update of $\mathrm{V}_{\mathrm{BB}}$ domains configurations depending on operating conditions (PVT, aging, etc.)


## Conclusions and Future Work

## Conclusions:

- Back-Bias is an effective knob for fine-grain delay/power tuning in quality-configurable functional units.
- Combined with global $\mathrm{V}_{\mathrm{DD}}$ scaling, this method overcomes the limitations of DVAS, by contrasting the "Wall-of-slack" phenomenon.
- First ever application of Back-Biasing to Quality Configurable Systems (to our knowledge).


## Future Developments:

- Devise method for runtime update of $\mathrm{V}_{\mathrm{BB}}$ domains configurations depending on operating conditions (PVT, aging, etc.)
- Investigate alternative partitioning techniques (irregular tiling).



## Implementation Flow



## 1. Implementation Phase:

- Partition circuit in VBB domains using regular tiling.
- Incremental placement:
- Insert well-taps
- Fix possible constraints violations due to cell displacement.



## Implementation Flow



## 2. Analysis Phase:

- Exhaustive exploration of all possible configs of Accuracy, $V_{B B}$, and $V_{D D}$
- STA to prune unfeasible configurations (timing violations)
- Power analysis on feasible configs
- Complexity
- Many configurations (thousands), but fast analysis.
- Feasible for < 10-15 $\mathrm{V}_{\mathrm{BB}}$ domains


