3/02/2024

Reliability issues and IC failure in VLSI




Developing IC is a time, labor,research and money extensive process. There are some physical and electrical reasons, which  can spoil the whole effort we put into. Therefore we must understand and analyse what reasons can fail this development process. In this article we will discuss about reliability issues in CMOS and try to understand reasons that leads to IC failure. This informative article meticulously examines various pivotal aspects surrounding the reliability of VLSI CMOS technology.

 

What is Reliability?

Reliability means how likely it is that a product/system/service will work well for a certain amount of time or under specific conditions without any issues.



In simpler terms, reliability is like:

i. Probability of success , 

ii. Durability , 

iii. Dependability

iv. Quality over time , 

v. Availability to do its job


Failure is deviation from compliance with the system specification for a given 
period of time. Failures can happen for different types of faults. Reasons might be design bugs, manufacturing defects, wear out of oxide or interconnect, external disturbances or intentional manhandling of a product. Although not all faults lead to errors.  A number of physical failure mechanisms that can affect the reliability of a CMOS ASIC. 


Yield and Reliability are two of the most important aspects for the development of new technology. Designing a reliable CMOS chips involves understanding and addressing of the potential reasons of failure.

Reliability Factors – I & II

If a device is used under the wrong use conditions, a failure may occur. Reliability of a device depends on how much stress it can handle. Some factors related to failure are :

i. Electric load, ii. Temperature, iii. Humidity, iv. Mechanical Stress, v. Static Electricity, vi. Effect of Repeated Stress

(i) Electric Load :

- Operation conditions determines the life of semiconductor devices.

- Electric power cause a rise of the junction temperature which might lead to device failure. The electric current should be lowered as far as possible.

- It is necessary to handle the surge current that flows when the switch is turned on or off and the surge voltage of inductive (L) load so that they do not exceed the maximum rated values.

(ii) Temperature :

-Temperature affects the life of semiconductor products. A rapid or gradual change results in deterioration of characteristics leading to device malfunction.

- The relation between the life “L” and temperature “T” :

- The life will be shortened if temperature rises.

- A ventilation device or heat radiation device used to avoid overheating issue.

(iii) Humidity:

- Usually IC chips are covered with surface protective film to protect from humidity. If a device is operated under severe humidity conditions, it should be operated particularly carefully.

(iv) Mechanical Stress :

- If the device is strongly vibrated during transportation, or if an extremely strong force is applied to a device during installation, the device may be directly, mechanically damaged. In addition, moisture or a contaminant may enter the device through the damaged area, and may cause deterioration of the device.

(v)  Static electricity:

- Electrostatic charge damages the equipment. Equipment incorporating devices is often charged with static electricity. In some cases, an Recently, plastic is generally used for the casing and the structure of equipment.

- Human bodies can be also charged with static electricity.

- While handling semiconductor devices, it is necessary to take static charge preventive measures

- This issues became more serious as device dimensions are aggressively scaled and operating frequencies becoming higher.

(v)  Effect of repeated stress :

- If a stress is repeatedly applied it might be stronger than steady stress.

- A high-low temperature cycle and intermittent internal heat generation cycle can apply stresses repeatedly. The effects of such cycles, such as rearrangement of the material structure and fatigue deterioration of resistance to distortion,are examined and utilized for evaluation of failures.


Failure Mechanisms  

(a) Time Dependent Dielectric Breakdown (TDDB):



Gate oxide thickness has reduced with technology nodes. Electric field across Tox is getting ever stronger. Oxide film breakage is caused by : (i) an initial defect , (ii) deterioration of the oxide film. Initial defect leads to an early failure. Deterioration of the oxide film leads to long-term reliability failure. Oxide layer breaks down if applied electric field exceeds dielectric breakdown withstand voltage. Even if electric field with lower value is applied for linger period of time may also cause breakage as time elapses. This type of breakage is referred to as a time dependent dielectric breakdown (TDDB). An empirical formula expresses the TDDB life :

t: Life in practical use (h) ; 

tt: Life in test (h) ; β: Electric field acceleration factor; 

E: Electric field strength in practical use (MV/cm)

Et: Electric field strength in test (MV/cm); Ea: Activation energy (eV) ;

k: Boltzmann constant (eV/K) ; 

T: Temperature for actual use (K) ;

Tt: Test temperature (K)

Effective methods to prevent these failures are:

(i) optimising the process in order to minimise variability.

(ii)formation an oxide film with less defects,

(iii) screening by use of high electric field during inspection/burn-in.


(b)Negative Bias Temperature Instability (NBTI) :

Four types of electric charge exist in gate oxide films:

(1) Mobile ionic charge Qm , (2) Fixed oxide charge,

(3) Interface trapped charge Qit , (4) Oxide trapped charge Qot

NBTI is an increase in the absolute threshold voltage, a degradation of the mobility, drain current and transconductance of p – MOSFETs at either negative Vg or elevated temperatures.  A stronger and faster NBTI effect is produced by their combined action.  Such fields and temperatures are typically encountered during burn in and during routine operation in high-performance ICs.

Si has 4 valence electrons ==>  At the surface of the silicon crystal atoms are missing and traps are formed. The density of these interface states Dit. After oxidation most interface states are saturated with oxygen atoms, interface quality improves. To reduce the number of dangling valence bonds further, surface is annealed with forming gas (mixture of Hydrogen and Nitrogen). The dangling silicon bonds are passivated by forming Si-H bonds. The number of electrically active interface states can be reduced to acceptable range.


These Si-H bonds have lower binding energy. Elevated temperature and high electric fields break these bonds and interface states reactivated. The exact properties of the interface defects, which are trivalent silicon atoms with one unpaired valence electron depends on the exact atomic configuration and on the orientation of the substrate. Holes interact with Si-H bond and weaken Si-H bond  At elevated temperature, the Si-H bonds dissociate : Si 3 ≡ SiH + h + → Si 3 ≡ Si • + H +

The effect of bias temperature instability can be observed in both, p-channel and n-channel MOSFETs. However p-channel MOSFETs with negative Vg stress are more susceptible to this kind of degradation. It has been reported that for NBTI degradation, channel cold holes are important. As the n-channel MOSFET biased into accumulation also has holes at the surface of the substrate, the threshold voltage shift should be similar to p-channel MOSFETs. Therefore, the lack of holes can not be the cause for the different degradation behavior.

Impact of NBTI on Circuits :

(i) Occurs primarily in p-channel MOSFETs with negative gate voltage bias and is negligible for positive gate voltage.

(ii) Usually occur during the “high” state of p-MOSFETs inverter operation.

(iii) Leads to timing shifts and potential circuit failure due to increased spreads in signal arrival in logic circuits.

(iv) Asymmetric degradation in timing paths can lead to non-functionality of sensitive logic circuits product field failures.


(c)Hot Carrier Injection (HCI):






Hot carrier injection is one of the most significant problems regarding reliability of state-of-art MOSFET. It is difficult to reduce the power supply voltage. For DSM and nano devices electric field strength is increasing. Hot carrier is a generic name for high-energy hot electrons and holes generated in the transistor. Hot carriers injection into a gate oxide film generate the interface state and fixed charge, and finally deteriorate the Vt and Gm of the MOSFET. As the Vt of the FET is increased, the circuit operation will become slow, and will finally operate abnormally. Hot Carrier is easily generated when the Vg < Vd/2 . When Vd> Vg, the carriers present in the channel will impact the Si crystal lattice and generate pairs of a hot electron and a hot hole (Impact Ionization). These pairs function as hot carriers. Hot carriers under strong Vd gain enough energy to break the barrier of Si/SiO2 inrterface and go through the gate oxide into the gate. As a result, either the gate oxide film is charged, or Si/SiO2 interface is damaged.


This lead to change is transistor characteristic. Generation mechanism : Channel hot electrons (CHE), Avalanche hot carriers (AHC), Substrate hot electrons (SHE). 

AHC shows remarkable change when devices are miniaturized.


(d) Soft Error : 

A very small amount of radioactive elements (U, Th etc.) are present in the package material. Abnormal operation of devices is caused by α particles radiated from that radioactive element. This problem is referred to as a soft error. This abnormal operation is temporary. So writing data again can restart normal operation. This problem is more dominant in advance node devices. In this dimension , the electric charge of signals handled in the devices is lowered. The electric charge of the noise generated by α particles that are radiated in the chip has a large impact that cannot be ignored.  The α particles are generated at the cell capacitor that stores 1-bit data (1 bit = minimum data unit of dynamic RAM).  The α particles generates electron-hole pairs in the substrate. The α particles loose their energies in generating the e-h pairs. The electrons generated in this process can invert the data of the cell capacitor. A cell capacitor is considered “L” if electrons exist and considered “H” if electron do not exist. If electrons are generated in the cell capacitor by α particles, data “H”will be inverted to data “L”. This is referred to as a soft error in the memory cell mode. Soft errors affect memories, registers, and combinational logic. Memories use error detecting and correcting codes to tolerate soft errors, so these errors rarely turn into failures in a well-designed system.

The cell capacitor data is read out to the bit line by diffusion, and then compared with the reference potential. If electrons generated by α particles flow into the bit line, the potential of the read out data or the reference potential may be lowered. If the data potential is lowered, data will be inverted from “H” to “L”. If the reference potential is lowered, the data will be inverted from “L” to “H”. This is referred to as a soft error in the bit line mode.  If the operation cycle (cycle time) of the dynamic RAM is shortened, the reference potential will be compared with the data potential more frequently. As a result, soft errors in the bit line mode will be increased. On the  other hand, change in the cycle time will not affect the soft errors in the memory cell mode. 

 Prevention of Soft Error :

(i) to use package material that contains less radioactive elements (α particle generative source).

(ii) to prevent α particles from entering the chip by coating organic material on the chip.

(iii) improvement of the bit line structure using wire materials of Al, poly-Si, etc., improvement of the sense amplifier, adoption of the return bit line etc.

(e) Electromigration (EM)


A chip may go above 100 Degree Celsius during practical operation. High frequency power loss & consequent heat dissipation contributes in increased temperature. Rise in temperature enhances solid-state metal ion diffusion. Electromigration is caused by scattering of the moving electrons with the ions, i.e., by momentum transfer between electrons and ions in metal interconnects. This ion-electron interaction is sometimes referred to as "electron wind.” This causes the wire to break or to short circuit to another wire.   Such situation void in interconnects can leads to open circuit i.e chip failure.

EM is one of the most menacing and persistent threat to interconnect reliability. Mean time to failure due to electromigration:


MTTF : Mean time to failure (h) , A : Constant of wire, J : Current density (A/cm2), n : Constant ,  Ea : Activation energy (eV) , k : Boltzmann constant (eV/K), T : Absolute temperature of wire (K)

The following factors can reduce the failures caused by electromigration:

a) Crystal structure (grain diameter, crystal orientation, etc.)

b) Addition of other elements to metal film

c) Laminated wiring structure

Electromigration depends on the current density J = I/wt. It is more likely to occur for wires carrying a DC current where the electron wind blows in a constant direction than for those with bidirectional currents.


(f) Self Heating:

Bidirectional wires are less prone to EM.  Although their current density contributes in by self-heating. High currents dissipate power in the wire. Since surrounding oxide or low-k dielectric is a thermal insulator, the wire temperature can become significantly greater than the underlying substrate. Hot wires exhibit greater resistance and delay.  EM is also highly sensitive to temperature, so self-heating may cause temperature-induced electromigration problems in the bidirectional wires. Brief pulses of high peak currents may even melt the interconnect. A significant percentage of the device self-heat energy flows vertically and laterally to interconnect layer. The local temperature rise depends upon the thermal dissipation path(s) away from the heat energy originating element. Self-heating is dependent on the RMS current density. A conservative rule to control reliability problems with self-heating is to keep Jrms < 15 mA/Rm2 for bidirectional aluminum wires on a silicon substrate.

The maximum capacitance of the wire can be estimated based on the RMS current. EM from high DC current densities is primarily a problem in power  and ground lines. Self-heating limits the RMS current density in bidirectional signal lines.

(g) Stress Migration :



Stress migration/stress-induced voiding (SIV) is wear out failure mechanisms in chip metallization. It causes an open circuit in the metal interconnects, especially at the via, since it is the weakest link. SM is caused by the interaction between the thermo-mechanical stress in the interconnect system and the diffusion of vacancies. The existence of thermal stress in the interconnect is caused by thermal expansion mismatch between the metal and the surrounding materials. The BEOL interconnect structure consists of several different materials like metal, dielectric, diffusion barrier, silicon substrate and capping layer. Fabrication of the structure involves several  thermal cycles from room temperature to about 400°C, a large amount of stress can be introduced due to the thermal expansion mismatch among these materials. Metal expands due to heating and then contracts during the cooling process although unable to retract to the original, since the metal is constrained by other material. As result, there is a tensile stress in metal layer. Metal atoms moves to balance stress condition, thus void is created. Void in metallization tends to nucleate and grow around the vias and blocks the flow of electrical current due to open ckt condition.

(h)CMOS Latchup :


A latch-up is a destructive short circuit phenomenon to the CMOS Structure. It can be defined as a low resistance path between voltage levels. It is caused by low-impedance path between the power supply rails of a MOSFET circuit through PNPN parasitic structure underneath. The circuit function is disrupted by latchup and currents are frequently large enough to cause permanent damage. The parasitic PNPN structure resembles and equivalent to Silicon Controlled Rectifier (SCR) structure. A PNPN structure which created by a PNP and an NPN transistor stacked next to each other. 

Immediately after latch up trigger, one of the transistors starts conducting and the other one begins follows it by start conducting.  They both stay in saturation for as long as the structure is forward-biased and some current flows through it.

(i) Electrostatic Discharge:



ESD is the release of stored static electricity. The most famous ESD (large scale) is lightning. ESD event that take place in chip is not visible.  ESD destroys about 20 % of electronic components before they are installed into a  system. ESD may only damage a component but it leads to further subsequent damages within a brief time during circuit operation. A person can acquire charges by simply walking across a room. When such a charged person/object then approaches an IC, an ESD event occurs, characterized by a high current within a few ns. A high current density and/or electric filed can damage conductor, semiconductor and insulator in an IC. Electrostatic straps are used in industry to protect from ESD where electronics circuits are packaged and assembeled. The circuit present inside the IC, will tend to be partially damaged or might breakdown when this high voltage pulse enters. When we buy different semiconductor component for computers we get it in a dark grey package , that is a external protection for ESD event. ESD in an IC is usually start with the oxide breakdown which result in percussion path. The high current density damages the semiconductor devices through thin-film fusing, filamentation, and junction spiking. The high electric field, on the other hand,can cause failure through dielectric breakdown or charge injection.

Watch the video lecture here:



Courtesy : Image by www.pngegg.com, www.pexels.com, www.pixabay.com