#### **2010 IEEE Aerospace Conference**

Big Sky, MT, March 11, 2010

Session#: 7.04 Reconfigurable Computing System Technologies

Pres #: 7.0401, Paper ID: 1079 Rm: Elbow 2, Time: 4:30pm

# **Spatial Avoidance of Hardware Faults using FPGA Partial Reconfiguration of Tile-Based Soft Processors**

**Authors:** Clint Gauer, Brock J. LaMeres & David Racek

Department of Electrical and Computer Engineering

Montana State University

**Presenter:** Brock J. LaMeres









### Acknowledgements

• This work was supported by:



Montana Space Grant Consortium (NASA EPSCoR)

http://spacegrant.montana.edu



NASA Exploration Systems Mission Directorate "Higher Education Program"

http://education.ksc.nasa.gov/esmdspacegrant/

• Special thanks to our project mentors from NASA's Advanced Avionics & Processor Systems (AAPS) Project

**Dr. Robert E. Ray**Marshall Space Flight Center
Reconfigurable Computing Task

**Dr. Andrew S. Keys**Marshall Space Flight Center
AAPS Project Manager

**Dr. Michael A. Johnson**Goddard Space Flight Center
High Performance Processor Task



### **Motivation**

- Radiation has a detrimental effect on electronics in space environments.
- The root cause is from electron/hole pairs creation as the radiation strikes the semiconductor portion of the device and ionizes the material.





#### **Types**

- alpha particles (Terrestrial, from packaging/doping)

- Neutrons (Terrestrial, secondary effect from

Galactic Cosmic Rays entering atmosphere)

- Heavy ions (Aerospace, direct ionization)

- *Proton* (Aerospace, secondary effect)



### **Motivation**

- Two types of failures mechanics are induced by radiation
  - 1) Total Ionizing Dose (TID)
    - The cumulative, long term ionizing damage to the device materials
    - Caused by low energy protons & electrons

- 2) Single Event Effects (SEE)
  - Transient spikes caused by Heavy Ions and protons
  - Can be both destructive & non-destructive



## **Motivation (TID)**

#### 1) Total Ionizing Dose (TID)

- As the electron/holes try to recombine, they experience different mobility rates  $(\mu_n > \mu_p)$
- Over time, the ionized particles can get trapped in the oxide or substrate of the device prior to recombination



- This can lead to:
- Threshold Shifting
- Leakage Current
- Timing Skew



### **Motivation (SEEs)**

#### 2) Single Event Effects (SEEs)

- Transient voltage/current induced in devices
- This can lead to both Non-Destructive and Destructive effects



#### **Non-Destructive**

Single Event Transient (SET)
Single Event Upset (SEU)
Single Event Func. Interrupt (SEFI)
Multi-Bit Upsets (MBU)

#### **Behavior**

A transient spike of voltage/current noise, can cause gate switching A transient captured in a storage device (FF/RAM) as a state change A fault that cannot be recovered from using a reset.

Multiple, simultaneous SEUs

#### **Destructive**

Single Event Latchup(SEL)
Single Event Burnout (SEB)
Single Event Gate Rupture (SEGR)

#### **Behavior**

Transient biases the parasitic bipolar SCR in CMOS causing latchup Transient causes the device to draw high current which damages part The energy is enough to damage the gate oxide



### **Mitigation of TIDs**

#### 1) Current Mitigation Techniques (TID)

- Parts can be "hardened" to TID through:
  - layout techniques (sizing of Q<sub>crit</sub>, enclosed layout)
  - guard rings
  - substrate doping
  - redundant circuitry
- Parts are specified in terms of:
  - "the amount of energy that can be tolerated by ionizing particles before the part performance is out of spec"
  - units are given in krad (Si), typically 300krad+
- Shielding Does Help
  - low energy protons/electrons can be stopped at the expense of weight



### **Mitigation of SEEs**

#### 2) Current Mitigation Techniques (SEEs)

- Triple Modular Redundancy (TMR)

- Reboot/Recovery Sequences



- Shielding <u>Does NOT</u> eliminate all SEEs
  - impractical to shield against high energy particles and Heavy Ions due to necessary mass



### **Drawback of Mitigation**

- Radiation Hardening = Slower Performance
  - All TID mitigation techniques lead to slower performance



- TID mitigation **DOES NOT** prevent SEEs



### **FPGAs & Radiation**

#### • Radiation Mitigation in FPGAs

- RAM based FPGAs are traditionally *soft* to radiation
- Fuse-based FPGAs provide some hardness, but give up the flexibility of real-time programmability



#### • Exploiting Reconfiguration

- The flexibility of FPGAs enables novel techniques to radiation tolerant computing
  - ex) Dynamic TMR, Spatial Avoidance of TID failures,
- The flexibility of FPGAs is attractive to weight constrained Aerospace applications
  - ex) Reduction of flight spares, internal spare circuitry



### FPGAs as a Solution?

#### • Field Programmable Gate Arrays





#### • Radiation Tolerance Through Architecture

- Redundant, Homogenous, Soft Processors





Types of Radiation Faults Seen in FPGAs



#### 1) Soft (SEU, SET)

- SEUs that can be recovered from using a reset

#### 2) Medium (SEFI)

- SEUs in reconfiguration memory, can only be recovered using reconfiguration

#### 3) Hard (TID / Displacement Damage)

- Damage to part of the chip due to TID or Displacement Damage



#### • Fault Recovery Procedures

#### Fault Type Recovery Action

#### **Soft Faults**

- TMR Voter detects fault
- 2 good processors complete current task
- Good 2 processors offload variable data
- All 3 processors are reset
- All 3 processors re-initialized with variable data
- All 3 processors resume operation in TMR

#### **Medium Faults**

- Same general procedure, *except* 

Bad processors is partially reconfigured

to reset configuration RAM



- A spare processor is brought online to complete TMR
- Bad processor is flagged as "DO NOT USE"





- Advantages of this Approach
  - 1) SEUs mitigated using traditional TMR
  - 2) Partial Reconfiguration technique increases hardness of RAM-based FPGAs
  - 3) Spatial avoidance of damaged regions of FPGA extend system lifetime
  - 4) Logical approach can be applied to RHBD FPGA fabrics (*SIRF*, etc...) for increased radiation immunity







# **System Prototyping**

#### • Many-Core Computing Architecture

- 16 picoBlaze Processors (3+13) implement on a Virtex-5 LX50
- The computer system controls basic peripherals
- A push button is used to mimic soft SEUs
- A PC GUI is created to inject hard failures
- HyperTerminal is used to mimic medium severity faults requiring partial reconfiguration
- Xilinx ChipScope used to monitor processor operation on all 16 processors





#### Normal Operation

- Processors **0, 1, and 2** are active (blue) and operating in TMR
- Processors **3-13** provide spare *picoBlaze* processors (gray)



(showing address lines between uP and memory for all 16 processors)



#### • Soft Fault Recovery

- Processors **0**, **1**, and **2** are active (blue) operating in TMR
- Processors **0** undergoes a soft fault and then recovers and resynchronizes





#### Hard Fault Recovery

- Processors 2 undergoes hard fault (induced by GUI, red)
- The system shuts down uP #2 and brings on spare processor uP #3 into TMR





#### Multiple Hard Faults

- Multiple hard faults are present
- uPs 3, 5, and 8 form TMR





### Timing/Area Impact

• Soft Fault Recovery (reset, reload variable information)

#### **Timing Overhead**

- TMR interrupt 2 clocks
- Reset 2 clocks

- Read variable data from good processors: 128 clocks (2 clks/inst, 64 bytes of RAM)

- Write variable data to reset processor: 128 clocks (2 clks/inst, 64 bytes of RAM)

**Total 260 clocks** = **2.6 us** (100 MHz V5 Clock)



#### Medium Severity Fault Recovery (SEFI)

- An initial hard failure can be *repaired* by going back to the effected processor and reconfiguring it.
- This handles the situation where an SEU occurred in the configuration RAM
- For this type of fault, a simple reset will not recover the processor

#### BUT

the processor hardware is still usable.





- Medium Severity Fault Recovery (SEFI on uP #0)
  - Repairing Processor 0 using Partial Reconfiguration





- Medium Severity Fault Recovery (SEFI on uP #1)
  - Repairing Processor 1 using Partial Reconfiguration



ICAP address x00018780corresponds to partial reconfiguration of Tile 1



### **Partial Reconfiguration Constraints**

- For our V5, the smallest quantum that can be partially reconfigured is 20 CLBs
  - 1 CLB contains: 2 Slices
  - 1 Slice contains: four LUTs
    - four storage elements
    - wide-function multiplexers
    - carry logic
- If you use BRAM in your design, <u>4 BRAMs</u> must be partially reconfigured together
- Care must be given to placing circuitry within the smallest partially reconfigured tile



Bus Macros are used to provided fixed routing channels between tiles.



# PR of a picoBlaze Core

#### Physical *picoBlaze* resource estimation:

| Site Type  | Available | Required | % Util |
|------------|-----------|----------|--------|
| LUT        | 320       | 163      | 50.94  |
| FF         | 320       | 76       | 23.75  |
| SLICEL     | 60        | 35       | 58.33  |
| SLICEM     | 20        | 12       | 60.00  |
| RAMBFIF036 | 4         | 1        | 25.00  |

- 24 CLBs, 1 BRAM

#### PR region resource use:

- 2 columns of 20 CLBs
- 1 column of BRAM

Smallest picoBlaze PR Tile
=
40 CLB + 4BRAM

#### **Bitstream file size(LX50T):**

- Partial bitstream for one PicoBlaze: 31.2 KB
- Full bitstream: 1,716 KB

#### **Reconfiguration time:**

- Roughly 200 clks/Byte (measured)
- Measured time: 66ms (100 MHz clk)
- Using MicroBlaze driven ICAP processor



A single PicoBlaze PR region





# **Questions**







