

# 3D Super-Via for Memory Applications

Hong Sangki (shong@tezzaron.com)

#### **Tezzaron Semiconductor Corporation**

Chicago - Singapore



### **Market Drivers for 3D**

| Density/ Functionality                 | Mobile & Wireless                        |  |
|----------------------------------------|------------------------------------------|--|
| Performance<br>✓ Faster<br>✓ Low Power | Workstations<br>Super-computer           |  |
| Cost/ Yield                            | Scaling cost<br>Yield<br>Packaging Yield |  |





## **Market Drivers for 3D Memory**

| Driver                     | Functionality                              | Technical<br>Parameter # 1 | Technical<br>Parameter # 2 | Value Indicator      |
|----------------------------|--------------------------------------------|----------------------------|----------------------------|----------------------|
| Stacked NAND<br>Flash      | Cell Phones<br>Hard Drives<br>Flash Drives | Memory density             |                            | High packing density |
| Microprocessor<br>+ Memory | Workstations                               | Latency<br>bandwidth       | Power                      | Execution time       |
| Memory                     | Multiple                                   | Density                    | Latency                    | Varies               |





# Advanced packaging trends: 3D IC, SiP & SoC



Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007

eZaron

### **How Real is 3DMemory?**



Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007



#### The 2007 ITRS Roadmap will include 3D!

- Tables for TSV size, pitch
- Bond point pitch
- Wafer thickness

Low density:  $10^2 - 10^4$  (> 50 µm pitch) Wafer Scale Packaging





High density: 10<sup>5</sup> – 10<sup>6</sup> (5 ~ 25 μm pitch) Wafer Scale Package



Page 6



#### Courtesy, Tezzaron Semi.



#### **2nd Wafer Stack & Thin**





#### **Wafer Backside after thinning**



Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007



#### **Backside Cu Metallization**





#### **3rd Wafer Stack**





#### Flip & I/O Pad Out





## **Stacking Process Sequential Picture** Page 13



→ After CMP → Si Recessed





#### **CPU + SRAM cross-section**



Wafer-to-wafer misalign ~0.4um

Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007



## **CPU/Memory Stack**

- R8051 CPU
  - 80MHz operation; 140MHz Lab test (VDD High)
  - 220MHz Memory interface
- IEEE 754 Floating point coprocessor
- 32 bit Integer coprocessor
- 2 UARTs, Int. Cont., 3 Timers, ...
- Crypto functions
- 128KBytes/layer main memory
- Completely synthesized, placed and routed in 3D with standard Cadence tools. Runs slightly better than predicted by models and tools.





Courtesy, Tezzaron Semi.



### **Speed gaining demo by Stacking Memory on CPU** Page 16



Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007

Tezaron

## **Results & Demonstration of 3D CPU** Page 17

#### Stacked 8051 CPU vs. Dallas semiconductor 80C420

- Video (split half screen)
- Complex Math Calculation
- Memory Operation
- Current consumption
- Power Consumption

3X faster 4.5X faster 7X faster 3X lesser 10X lesser



#### Courtesy, Tezzaron Semi.



### **Results of early Qual data**

- 100,000 device thermal cycles (-65 to 150C 15 minute soak)
  - No failures
  - Two build lots
- 168 hour high temp
  - No Failures
  - Extended to 336 and then 504 with no failures
- Hot spot delamination testing
  - >10watts/sqmm, no failures
- Life test under bias
  - >10,000 hours, no failure

#### **Concept of Tezzaron's 3D DRAM**



Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007





But.....this requires,

Millions of vertical interconnect!



### How many interconnects are required?<sup>21</sup>





### **3D Interconnect**



Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007



## What can Tezzaron 3D DRAM Achieve?

- Faster Access Time
- Lower Power
- Denser
- Reliable
- Compatible
- Lower Costs









- Global Interconnect "problem"
- Span of Control



... in the older 1.0  $\mu$ m Al/SiO<sub>2</sub> technology generation the transistor delay was ~20 ps and the RC delay of a 1 mm line was ~ 1.0 ps, while in a projected 35nm Cu/low  $\kappa$  technology generation the transistor delay will be ~1.0 ps, and the RC delay of a 1 mm line will be ~250 ps·[i]

In addition, in the 0.13um technology node approximately 51% of microprocessor power was consumed by interconnect, with a projection that without changes in design philosophy, in the next 5 years up to 80% of microprocessor power will be consumed by interconnect<sup>[ii]</sup>

[i] J. Davis and J. Meindl, Interconnect Technology and Design for Gigascale Integration, Kluwer Academic Publishers, 2003.

[ii] N. Magen, A. Kolodny, U. Weiser, N. Shamir, "Interconnect-Power Dissipation in a Microprocessor," ACM System-Level Interconnect Prediction Workshop, Feb 2004







## **DRAM wants 2 different processes!**

| Bit cells         | Low leakage         | High Vt Devices     |
|-------------------|---------------------|---------------------|
|                   | -slow refresh       | Vneg Well           |
|                   | -low power          | Thick Oxide         |
|                   | -low GIDL           |                     |
|                   |                     |                     |
| Sense Amps        | High speed          | Low Vt Devices      |
| Word line drivers | -better sensitivity | Copper interconnect |
| Device I/O        | -better bandwidth   | Thin Oxides         |
|                   | -lower voltage      |                     |
|                   |                     |                     |





#### **2. Lower Power!**

 $P_{avg} = VDD \ge I_{avg} = C_{tot} \ge VDD^2 \ge f_{clk}$ C is mostly due to wiring

Therefore:

$$P_{avg} \propto l_{avg}$$

Or:

$$P_{avg \ stacked} \approx \frac{P_{avg \ single \ layer}}{\# \ of \ layers}$$

| Operation                                                                                           | Energy                                            |
|-----------------------------------------------------------------------------------------------------|---------------------------------------------------|
| 32-bit ALU operation                                                                                | 5 pJ                                              |
| 32-bit register read                                                                                | 10 pJ                                             |
| Read 32 bits from 8K RAM                                                                            | 50 pJ                                             |
| Move 32 bits across 10mm chip                                                                       | 100 pJ                                            |
| Move 32 bits off chip<br>Calculations using a 130nm process opera<br>(Source: Bill Dally, Stanford) | 1300 to 1900 pJ<br>ting at a core voltage of 1.2V |





### 3. Lower Costs & Higher Yield!

- Less processing per layer
- Better optimization per wafer
- Higher bit density in memories
- Lower test cost using Bi-STAR<sup>TM</sup>
- Higher yield using Bi-STAR<sup>TM</sup>



Page 28

#### **Standard DRAM Utilization**



#### 66% Savings in logic per memory cell



Micro-Systems Packaging Initiative (MSPI) Packaging Workshop 2007



### **Increasing Die Overhead**



#### Chip size overhead of DDR3 relative to DDR2

| Device density       | 90 nm |      | 80 nm |      |
|----------------------|-------|------|-------|------|
|                      | DDR2  | DDR3 | DOR2  | DORS |
| Chip size            | 1     | 1.22 | 1     | 1.23 |
| Gross dice per wafer | 1     | 0.81 | 1     | 0.62 |





### **The Bandwidth Crisis:**

You know you have a problem when there is a log scale....





#### **The Detail**

|                                                    | 1/1/2007 | 7/1/2008 | 1/1/2010 | 7/1/2011 | 1/1/2013 | 7/1/2014 | 1/1/2016 |
|----------------------------------------------------|----------|----------|----------|----------|----------|----------|----------|
| Number of cores                                    | 4        | 8        | 16       | 32       | 64       | 128      | 256      |
| Clock (GHz)                                        | 2        | 3        | 3.3      | 3.63     | 3.993    | 4.3923   | 4.83153  |
| FLOP/core                                          | 2        | 4        | 4        | 4        | 4        | 4        | 4        |
| 0.5 byte/FLOP (GB/s)                               | 8        | 48       | 105.6    | 232.32   | 511.104  | 1124.429 | 2473.743 |
| 8 byte/FLOP (GB/s)                                 | 128      | 768      | 1689.6   | 3717.12  | 8177.664 | 17990.86 | 39579.89 |
| OPS/core                                           | 4        | 6        | 6        | 6        | 6        | 6        | 6        |
| 0.1 byte/OP (GB/s)                                 | 3.2      | 14.4     | 31.68    | 69.696   | 153.3312 | 337.3286 | 742.123  |
| 0.25 byte/OP (GB/s)                                | 8        | 36       | 79.2     | 174.24   | 383.328  | 843.3216 | 1855.308 |
| 4 byte/OP (GB/s)                                   | 128      | 576      | 1267.2   | 2787.84  | 6133.248 | 13493.15 | 29684.92 |
| Peak Memory Xfer rate per Channel (GB/s)           | 6.4      | 10.7     | 14.4     | 16.8     | 19.2     | 28.8     | 38.4     |
| Sustained Memory Xfer rate per Channel (GB/s)      | 3.2      | 5.35     | 7.2      | 8.4      | 9.6      | 14.4     | 19.2     |
| Best Case Number of channels to support Float OPS  | 3        | 9        | 15       | 28       | 54       | 79       | 129      |
| Power Required for I/O 40mW/pin (in Watts)         | 9.6      | 28.8     | 48       | 89.6     | 172.8    | 252.8    | 412.8    |
| Worst Case Number of channels to support Float OPS | 40       | 144      | 235      | 443      | 852      | 1250     | 2062     |
| Power Required for I/O 40mW/pin (in Watts)         | 128      | 460.8    | 752      | 1417.6   | 2726.4   | 4000     | 6598.4   |
| Best Case Number of channels to support OPS        | 1        | 3        | 5        | 9        | 16       | 24       | 39       |
| Power Required for I/O 40mW/pin (in Watts)         | 3.2      | 9.6      | 16       | 28.8     | 51.2     | 76.8     | 124.8    |
| Worst Case Number of channels to support OPS       | 40       | 108      | 176      | 332      | 639      | 938      | 1547     |
| Power Required for I/O 40mW/pin (in Watts)         | 128      | 345.6    | 563.2    | 1062.4   | 2044.8   | 3001.6   | 4950.4   |
| Best Case Number of channels to support mixed OPS  | 2        | 5        | 8        | 14       | 27       | 40       | 65       |
| Power Required for I/O 40mW/pin (in Watts)         | 6.4      | 16       | 25.6     | 44.8     | 86.4     | 128      | 208      |



#### **Market Size**

2006 Total Worldwide Semiconductor Market = \$247.7 Billion







#### **Addressable 3D Market**





### **3D Heterogeneous Integration**

#### Die Photograph of the Itanium 2 MPU (~2/3 of Area is Cache Memory)



Source: Intel

BEFORE Intel Photo used as proxy

#### **Only Memory Directly** Compatible with Logic (virtually no choice!)

Single Die~ 430 mm2 2D IC "All or Nothing"

Wafer Cost ~ \$6,000

Low yield ~ 15%, ~ 10 parts per wafer

memory costs ~ \$44/MB

AFTER: 3D IC 14x increase in memory density **4X Logic Cost Reduction**  $29x \rightarrow 100x$  memory cost reduction (choice!)

128MB not 9MB

memory costs ~  $1.50/MB \rightarrow 0.44/MB$ 



Page 35