Design of a Radiation Tolerant Computing System Based on a Many-Core FPGA Architecture

Presenter: Dr. Brock J. LaMeres
Authors: Dr. Brock J. LaMeres, Erwin Dunbar, Pat Kujawa, David Racek, Anthony Thomason, Colin Tilleman and Clint Gauer

Department of Electrical and Computer Engineering
Montana State University
Bozeman, MT
Acknowledgements

• This work was supported by:

Montana Space Grant Consortium
http://spacegrant.montana.edu

NASA Exploration Systems Mission Directorate
“Higher Education Program”
http://education.ksc.nasa.gov/esmdspacegrant/

• Special thanks to our project mentors from NASA’s
  *Radiation-Hardened Electronics for Space Environments (RHESE) Project*

  **Dr. Robert E. Ray**
  Marshall Space Flight Center
  Reconfigurable Computing Task

  **Dr. Andrew S. Keys**
  Marshall Space Flight Center
  RHESE Project Manager

  **Dr. Michael A. Johnson**
  Goddard Space Flight Center
  High Performance Processor Task
Motivation

- Radiation has a detrimental effect on electronics in space environments.
- The root cause is from electron/hole pairs creation as the radiation strikes the semiconductor portion of the device and ionizes the material.

Types

- **alpha particles** (Terrestrial, from packaging/doping)
- **Neutrons** (Terrestrial, secondary effect from Galactic Cosmic Rays entering atmosphere)
- **Heavy ions** (Aerospace, direct ionization)
- **Proton** (Aerospace, secondary effect)
Motivation

- Two types of failures mechanics are induced by radiation

  1) Total Ionizing Dose (TID)
     - The cumulative, long term ionizing damage to the device materials
     - Caused by low energy protons & electrons

  2) Single Event Effects (SEE)
     - Transient spikes caused by Heavy Ions and protons
     - Can be both destructive & non-destructive
Motivation (TID)

1) Total Ionizing Dose (TID)

- As the electron/holes try to recombine, they experience different mobility rates ($\mu_n > \mu_p$)
- Over time, the ionized particles can get trapped in the oxide or substrate of the device prior to recombination
- This can lead to:
  - Threshold Shifting
  - Leakage Current
  - Timing Skew
2) Single Event Effects (SEEs)

- Transient voltage/current induced in devices
- This can lead to both Non-Destructive and Destructive effects

**Non-Destructive**

<table>
<thead>
<tr>
<th>Behavior</th>
<th>Single Event Transient (SET)</th>
</tr>
</thead>
<tbody>
<tr>
<td>A transient spike of voltage/current noise, can cause gate switching</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Behavior</th>
<th>Single Event Upset (SEU)</th>
</tr>
</thead>
<tbody>
<tr>
<td>A transient captured in a storage device (FF/RAM) as a state change</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Behavior</th>
<th>Multi-Bit Upsets (MBU)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multiple, simultaneous SEUs</td>
<td></td>
</tr>
</tbody>
</table>

**Destructive**

<table>
<thead>
<tr>
<th>Behavior</th>
<th>Single Event Latchup (SEL)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Transient biases the parasitic bipolar SCR in CMOS causing latchup</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Behavior</th>
<th>Single Event Burnout (SEB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Transient causes the device to draw high current which damages part</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Behavior</th>
<th>Single Event Gate Rupture (SEGR)</th>
</tr>
</thead>
<tbody>
<tr>
<td>The energy is enough to damage the gate oxide</td>
<td></td>
</tr>
</tbody>
</table>
Mitigation of TIDs

1) Current Mitigation Techniques (TID)

- Parts can be “hardened” to TID through:
  - layout techniques (sizing of $Q_{\text{crit}}$, enclosed layout)
  - substrate doping
  - redundant circuitry

- Parts are specified in terms of:
  - “the amount of energy that can be tolerated by ionizing particles before the part performance is out of spec”
  - units are given in krad (Si), typically 300krad+

- Shielding Does Help
  - low energy protons/electrons can be stopped at the expense of weight
Mitigation of SEEs

2) Current Mitigation Techniques (SEEs)

- Triple Modular Redundancy (TMR)

- Reboot/Recovery Sequences

- Shielding **Does NOT** eliminate all SEEs
  - impractical to shield against high energy particles and Heavy Ions due to necessary mass
Drawback of Mitigation

- **Radiation Hardening = Slower Performance**
  - All TID mitigation techniques lead to slower performance

- TID mitigation **DOES NOT** prevent SEEs

---

FPGAs & Radiation

- **Radiation Mitigation in FPGAs**
  - RAM based FPGAs are traditionally *soft* to radiation
  - Fuse-based FPGAs provide some hardness, but give up the flexibility of real-time programmability

- **Exploiting Reconfiguration**
  - The flexibility of FPGAs enables novel techniques to radiation tolerant computing
    
    *ex)* Dynamic TMR, Spatial Avoidance of TID failures,
  
  - The flexibility of FPGAs is attractive to weight constrained Aerospace applications
    
    *ex)* Reduction of flight spares, internal spare circuitry
FPGAs as a Solution?

- Field Programmable Gate Arrays

- FPGAs have followed Moore’s Law and now yield comparable processing power to ASICs
Many-Core Architecture

- Radiation Tolerance Through Architecture
  - Redundant, Homogenous, Soft Processors
  - At Any Given Time, 3 are configured in *Triple Modular Redundancy* (TMR)
Many-Core Architecture

- Types of Radiation Faults Seen in FPGAs

1) Soft Faults
   - SEUs that can be recovered from using a reset

2) Medium Severity Faults
   - SEUs in reconfiguration memory, can only be recovered using reconfiguration

3) Hard Faults
   - Damage to part of the chip due to TID or Displacement Damage
Many-Core Architecture

- **Fault Recovery Procedures**

<table>
<thead>
<tr>
<th>Fault Type</th>
<th>Recovery Action</th>
</tr>
</thead>
</table>
| **Soft Faults** | - TMR Voter detects fault  
                   - 2 good processors complete current task  
                   - Good 2 processors offload variable data  
                   - All 3 processors are reset  
                   - All 3 processors re-initialized with variable data  
                   - All 3 processors resume operation in TMR |
| **Medium Faults** | - Same general procedure, *except*  
                    Bad processors is **partially reconfigured**  
                    to reset configuration RAM |
| **Hard Faults** | - A spare processor is brought online to complete TMR  
                    - Bad processor is flagged as “DO NOT USE” |
Many-Core Architecture

- **Advantages of this Approach**

  1) SEUs mitigated using traditional TMR

  2) Partial Reconfiguration technique increases *hardness* of RAM-based FPGAs

  3) Spatial avoidance of damaged regions of FPGA extend system lifetime

  4) Logical approach can be applied to RHBD FPGA fabrics (*SIRF*, etc…) for increased radiation immunity
System Prototyping

- Many-Core Computing Architecture
  - 64 picoBlaze Processors (3+61) implement on a Virtex-5 FX50
  - The computer system controls basic peripherals
  - A push button is used to mimic soft SEUs
  - A PC GUI is created to inject hard failures
  - HyperTerminal is used to mimic medium severity faults requiring partial reconfiguration
  - Xilinx ChipScope used to monitor processor operation on all 64 processors

PC Gui to induce Hard Failures

ML507 V5 Platform w 64 pBlaze uPs

ChipScope Internal Logic Analyzer
System Demonstration

- **Initial Operation**
  - Processors 0, 1, and 2 are active (blue) and operating in TMR
  - Processors 3-63 provide 61 spare *picoBlaze* processors (gray)

ChipScope shows uP 1,2,3 are running in synch with no faults

(GUI indicates uP 0, 1, and 2 are active (blue))

(showing address lines between uP and memory for all 64 processors)
System Demonstration

- **Soft Fault Recovery**
  - Processors 0, 1, and 2 are active (blue) operating in TMR
  - Processors 0 undergoes a soft fault and then recovers and resynchronizes

```
+-------------------+-------------------+-------------------+-------------------+-------------------+
| Bus/Signal        | X     | O     | 510   | 515   | 525   | 545   | 550   |
|-------------------+-------+-------+-------+-------+-------+-------+-------|
| Address 0         | C000  | 0EE   | 0EE   | 0EE   | 0EE   | 0EE   | 0EE   |
| Address 1         | C000  | 0EE   | 0EE   | 0EE   | 0EE   | 0EE   | 0EE   |
| Address 2         | 0EE   | 0EE   | 0EE   | 0EE   | 0EE   | 0EE   | 0EE   |
| Address 3         | 000   | 000   | 000   | 000   | 000   | 000   | 000   |
| Address 4         | 000   | 000   | 000   | 000   | 000   | 000   | 000   |
| Address 5         | 000   | 000   | 000   | 000   | 000   | 000   | 000   |
| Address 6         | 000   | 000   | 000   | 000   | 000   | 000   | 000   |
| Address 7         | 000   | 000   | 000   | 000   | 000   | 000   | 000   |
| Address 8         | 000   | 000   | 000   | 000   | 000   | 000   | 000   |
| Address 9         | 000   | 000   | 000   | 000   | 000   | 000   | 000   |
+-------------------+-------+-------+-------+-------+-------+-------+-------|
```

System initialized and running normally in TMR mode.

Processor 0 has been corrupted by an SEU. The TMR detects the failure.

Processor 0 brought back into synch with other two processors.

GUI indicates uP 0, 1, and 2 are active (blue)
• **Hard Fault Recovery**
  - Processors 1 undergoes hard fault (induced by GUI, red)
  - The system shuts down uP #1 and brings on spare processor uP #3 into TMR

  ![Waveform Diagram](image)
  - Processor 1 has hard fault so is shut down
  - Spare processor 3 is brought online, resynchronized, and reinitialized to form TMR
  - GUI indicates uP 1 is in hard fault (red). uP 0,2,3 form TMR (blue).
System Demonstration

- **Multiple Hard Faults**
  - Multiple hard faults are present
  - uPs 1, 6, and 12 form TMR
System Demonstration

- **Medium Severity Fault Recovery (PR)**
  - An initial hard failure can be *repaired* by going back to the effected processor and reconfiguring it.
  - This handles the situation where an SEU occurred in the configuration RAM
  - For this type of fault, a simple reset will not recover the processor
  - **BUT**
    - the processor hardware is still usable.

- Logistics: a MicroBlaze soft processor is used to read the PR bit streams through the SystemACE and write to the ICAP port of the Virtex-5.
Timing/Area Impact

- **Soft Fault Recovery** (reset, reload variable information)

  **Timing Overhead**
  
<table>
<thead>
<tr>
<th>Activity</th>
<th>Clocks</th>
</tr>
</thead>
<tbody>
<tr>
<td>TMR interrupt</td>
<td>2 clocks</td>
</tr>
<tr>
<td>Reset</td>
<td>2 clocks</td>
</tr>
<tr>
<td>Read variable data from good processors:</td>
<td>128</td>
</tr>
<tr>
<td>Write variable data to reset processor:</td>
<td>128</td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>260</strong></td>
</tr>
</tbody>
</table>

  (2 clks/inst, 64 bytes of RAM)

Total: 260 clocks = **2.6 us** (100 MHz V5 Clock)
Partial Reconfiguration Constraints

- For our V5, the smallest quantum that can be partially reconfigured is 20 CLBs
  - 1 CLB contains: 2 Slices
  - 1 Slice contains: four LUTs
    - four storage elements
    - wide-function multiplexers
    - carry logic

- If you use BRAM in your design, 4 BRAMs must be partially reconfigured together

- Care must be given to placing circuitry within the smallest partially reconfigured tile

- Bus Macros are used to provided fixed routing channels between tiles.
PR of a *picoBlaze* Core

**Physical *picoBlaze* resource estimation:**

<table>
<thead>
<tr>
<th>Site Type</th>
<th>Available</th>
<th>Required</th>
<th>% Util</th>
</tr>
</thead>
<tbody>
<tr>
<td>LUT</td>
<td>320</td>
<td>163</td>
<td>50.94%</td>
</tr>
<tr>
<td>FF</td>
<td>320</td>
<td>76</td>
<td>23.75%</td>
</tr>
<tr>
<td>SLICEL</td>
<td>60</td>
<td>35</td>
<td>58.33%</td>
</tr>
<tr>
<td>SLICEM</td>
<td>20</td>
<td>12</td>
<td>60.00%</td>
</tr>
<tr>
<td>RAMBFIFO36</td>
<td>4</td>
<td>1</td>
<td>25.00%</td>
</tr>
</tbody>
</table>

- 24 CLBs, 1 BRAM

**PR region resource use:**
- 2 columns of 20 CLBs
- 1 column of BRAM

**Bitstream file size (LX50T):**
- Partial bitstream for one PicoBlaze: 31.2 KB
- Full bitstream: 1,716 KB

**Reconfiguration time:**
- Roughly 200 clks/Byte (measured)
- Measured time: **66 ms** (100 MHz clk)
- Using MicroBlaze driven ICAP processor

A single PicoBlaze PR region

Smallest *picoBlaze* PR Tile = 40 CLB + 4 BRAM
Future Work

- \textit{microBlaze} Soft Processor

Shuttle Processor Board

Virtex-5

"Design of a Radiation Tolerant Computing System Based on a Many-Core FPGA Architecture"
Future Work

Questions?