### Arquitectura de Computadores

#### Capítulo 5. Almacenamiento y otros aspectos de la E/S

Based on the original material of the book: D.A. Patterson y J.L. Hennessy "Computer Organization and Design: The Hardware/Software Interface" 4<sup>th</sup> edition.

Escuela Politécnica Superior Universidad Autónoma de Madrid

Profesores: G130 y G131: Iván González Martínez G136: Francisco Javier Gómez Arribas

### Introduction

I/O devices can be characterized by

- Behaviour: input, output, storage
- Partner: human or machine
- Data rate: bytes/sec, transfers/sec
- I/O bus connections



### **I/O System Characteristics**

- Dependability is important
  - Particularly for storage devices
- Performance measures
  - Latency (response time)
  - Throughput (bandwidth)
  - Desktops & embedded systems
    - Mainly interested in response time & diversity of devices
  - Servers
    - Mainly interested in throughput & expandability of devices

### **Dependability Measures**



- Reliability: mean time to failure (MTTF)
- Service interruption: mean time to repair (MTTR)
- Mean time between failures
  - MTBF = MTTF + MTTR
  - Availability = MTTF / (MTTF + MTTR)
- Improving Availability
  - Increase MTTF: fault avoidance, fault tolerance, fault forecasting
  - Reduce MTTR: improved tools and processes for diagnosis and repair



### Nonvolatile, rotating magnetic storage





### **Disk Sectors and Access**

- Each sector records
  - Sector ID
  - Data (512 bytes, 4096 bytes proposed)
  - Error correcting code (ECC)
    - Used to hide defects and recording errors
  - Synchronization fields and gaps
- Access to a sector involves
  - Queuing delay if other accesses are pending
  - Seek: move the heads
  - Rotational latency
  - Data transfer
  - Controller overhead

### **Disk Access Example**

- Given
  - 512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk
- Average read time
  - 4ms seek time
    - $+\frac{1}{2}/(15,000/60) = 2ms$  rotational latency
    - + 512`B / 100MB/s = 0.005ms transfer time
    - + 0.2ms controller delay
    - = 6.2ms
- If actual average seek time is 1ms
  - Average read time = 3.2ms

### RAID

- Redundant Array of Inexpensive (Independent) Disks
  - Use multiple smaller disks (c.f. one large disk)
  - Parallelism improves performance
  - Plus extra disk(s) for redundant data storage
- Provides fault tolerant storage system
  - Especially if failed disks can be "hot swapped"
  - RAID 0
    - No redundancy ("AID"?)
      - Just stripe data over multiple disks
    - But it does improve performance

## RAID 1 & 2

- RAID 1: Mirroring
  - N + N disks, replicate data
    - Write data to both data disk and mirror disk
    - On disk failure, read from mirror
- RAID 2: Error correcting code (ECC)
  - N + E disks (e.g., 10 + 4)
  - Split data at bit level across N disks
  - Generate E-bit ECC
  - Too complex, not used in practice

### **RAID 3: Bit-Interleaved Parity**

- N + 1 disks
  - Data striped across N disks at byte level
  - Redundant disk stores parity
  - Read access
    - Read all disks
  - Write access
    - Generate new parity and update all disks
  - On failure
    - Use parity to reconstruct missing data
- Not widely used

### **RAID 4: Block-Interleaved Parity**

- N + 1 disks
  - Data striped across N disks at block level
  - Redundant disk stores parity for a group of blocks
  - Read access
    - Read only the disk holding the required block
  - Write access
    - Just read disk containing modified block, and parity disk
    - Calculate new parity, update data disk and parity disk
  - On failure
    - Use parity to reconstruct missing data
- Not widely used

### RAID 3 vs RAID 4



### **RAID 5: Distributed Parity**

### N + 1 disks

- Like RAID 4, but parity blocks distributed across disks
  - Avoids parity disk being a bottleneck
- Widely used

| 0    1    2    3    P0      4    5    6    7    P1      8    9    10    11    P2      12    13    14    15    P3      16    17    18    19    P4      20    21    22    23    P5 | 4    5    6    P1    7      8    9    P2    10    11      12    P3    13    14    15      P4    16    17    18    19      20    21    22    23    P5 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| RAID 4                                                                                                                                                                           | RAID 5                                                                                                                                               |

### **RAID 6: P + Q Redundancy**

- N + 2 disks
  - Like RAID 5, but two lots of parity
  - Greater fault tolerance through more redundancy
- Multiple RAID
  - More advanced systems give similar fault tolerance with better performance

### **RAID Summary**

- RAID can improve performance and availability
  - High availability requires hot swapping
- Assumes independent disk failures
  - Too bad if the building burns down!
- See "Hard Disk Performance, Quality and Reliability"
  - http://www.pcguide.com/ref/hdd/perf/index.htm

### **Flash Storage**

Nonvolatile semiconductor storage

- 100× 1000× faster than disk
- Smaller, lower power, more robust
- But more \$/GB (between disk and DRAM)





## **Flash Types**

- NOR flash: bit cell like a NOR gate
  - Random read/write access
  - Used for instruction memory in embedded systems
- NAND flash: bit cell like a NAND gate
  - Denser (bits/area), but block-at-a-time access
  - Cheaper per GB
  - Used for USB keys, media storage, …
- Flash bits wears out after 1000's of accesses
  - Not suitable for direct RAM or disk replacement
  - Wear leveling: remap data to less used blocks

## **Interconnecting Components**

- Need interconnections between
  CPU, memory, I/O controllers
- Bus: shared communication channel
  - Parallel set of wires for data and synchronization of data transfer
  - Can become a bottleneck
- Performance limited by physical factors
  - Wire length, number of connections
- More recent alternative: high-speed serial connections with switches
  - Like networks

# **Bus Types**

- Processor-Memory buses
  - Short, high speed
  - Design is matched to memory organization
  - I/O buses
    - Longer, allowing multiple connections
    - Specified by standards for interoperability
    - Connect to processor-memory bus through a bridge

### **Bus Signals and Synchronization**

- Data lines
  - Carry address and data
  - Multiplexed or separate
- Control lines
  - Indicate data type, synchronize transactions
- Synchronous
  - Uses a bus clock
- Asynchronous
  - Uses request/acknowledge control lines for handshaking

### **I/O Bus Examples**

|                     | Firewire             | USB 2.0                           | PCI Express                                 | Serial ATA | Serial<br>Attached<br>SCSI |
|---------------------|----------------------|-----------------------------------|---------------------------------------------|------------|----------------------------|
| Intended use        | External             | External                          | Internal                                    | Internal   | External                   |
| Devices per channel | 63                   | 127                               | 1                                           | 1          | 4                          |
| Data width          | 4                    | 2                                 | 2/lane                                      | 4          | 4                          |
| Peak<br>bandwidth   | 50MB/s or<br>100MB/s | 0.2MB/s,<br>1.5MB/s, or<br>60MB/s | 250MB/s/lane<br>1×, 2×, 4×,<br>8×, 16×, 32× | 300MB/s    | 300MB/s                    |
| Hot<br>pluggable    |                      |                                   | Depends                                     | Yes        | Yes                        |
| Max length          | 4.5m                 | 5m                                | 0.5m                                        | 1m         | 8m                         |
| Standard            | IEEE 1394            | USB<br>Implementers<br>Forum      | PCI-SIG                                     | SATA-IO    | INCITS TC<br>T10           |

### **I/O Bus Examples**

|                                    | Firewire  | USB 2.0                           | PCI Express                                 | Serial ATA | Serial<br>Attached<br>SCSI |
|------------------------------------|-----------|-----------------------------------|---------------------------------------------|------------|----------------------------|
| Intended use                       | External  | External                          | Internal                                    | Internal   | External                   |
| Devices per channel                | 63        | 127                               | 1                                           | 1          | 4                          |
| Data width                         | 4         | 2                                 | 2/lane                                      | 4          | 4                          |
| Peak 50MB/s o<br>bandwidth 100MB/s |           | 0.2MB/s,<br>1.5MB/s, or<br>60MB/s | 250MB/s/lane<br>1×, 2×, 4×,<br>8×, 16×, 32× | 300MB/s    | 300MB/s                    |
| Hot Yes pluggable                  |           | Yes                               | Depends                                     | Yes        | Yes                        |
| Max length                         | 4.5m      | 5m                                | 0.5m                                        | 1m         | 8m                         |
| Standard                           | IEEE 1394 | USB<br>Implementers<br>Forum      | PCI-SIG                                     | SATA-IO    | INCITS TC<br>T10           |

### **Processor/core Interconnect Buses**

### Quick Path Interconnect (QPI)

| Versión<br>Intel QuickPath Interconnect | Año  | Frecuencia max. | Ancho enlace max. |           | Ancho de banda max.<br>16-Bit (unidireccional)* |
|-----------------------------------------|------|-----------------|-------------------|-----------|-------------------------------------------------|
| 1.0                                     | 2008 | 3,2 GHz         | 20 Bit            | 25,6 GB/s | 12,8 GB/s                                       |

### HyperTransport (HT)

| HyperTransport version | Year | Max. HT frequency | Max. link width | Max. aggregate bandwidth (bi-directional) | Max. bandwidth at 16-bit unidirectional (GB/s) | Max. bandwidth at 32-bit unidirectional* (GB/s) |
|------------------------|------|-------------------|-----------------|-------------------------------------------|------------------------------------------------|-------------------------------------------------|
| 1.0                    | 2001 | 800 MHz           | 32-bit          | 12.8 GB/s                                 | 3.2                                            | 6.4                                             |
| 1.1                    | 2002 | 800 MHz           | 32-bit          | 12.8 GB/s                                 | 3.2                                            | 6.4                                             |
| 2.0                    | 2004 | 1.4 GHz           | 32-bit          | 22.4 GB/s                                 | 5.6                                            | 11.2                                            |
| 3.0                    | 2006 | 2.6 GHz           | 32-bit          | 41.6 GB/s                                 | 10.4                                           | 20.8                                            |
| 3.1                    | 2008 | 3.2 GHz           | 32-bit          | 51.2 GB/s                                 | 12.8                                           | 25.6                                            |

### System Architecture Single Processor - Intel

Depends on the microprocessor architecture and the chipset manufacturer

Intel Core i7



Intel<sup>®</sup> X58 Express Chipset Block Diagram

#### System Architecture Dual Processor - AMD



### I/O Management

- I/O is mediated by the OS
  - Multiple programs share I/O resources
    Need protection and scheduling
  - I/O causes asynchronous interrupts
    - Same mechanism as exceptions
  - I/O programming is fiddly
    - OS provides abstractions to programs

### I/O Commands

- I/O devices are managed by I/O controller hardware
  - Transfers data to/from device
  - Synchronizes operations with software
- Command registers
  - Cause device to do something (read, write)
  - Status registers
    - Indicate what the device is doing and occurrence of errors
- Data registers
  - Write: transfer data to a device
  - Read: transfer data from a device

## **I/O Register Mapping**

### Memory mapped I/O

- Registers are addressed in same space as memory
- Address decoder distinguishes between them
- OS uses address translation mechanism to make them only accessible to kernel
- I/O instructions
  - Separate instructions to access I/O registers
  - Can only be executed in kernel mode
  - Example: x86 (Intel)

# Polling

- Periodically check I/O status register
  - If device ready, do operation
  - If error, take action
- Common in small or low-performance realtime embedded systems
  - Predictable timing
  - Low hardware cost
- In other systems, wastes CPU time

### Interrupts

- When a device is ready or error occurs
  - Controller interrupts CPU
- Interrupt is like an exception
  - But not synchronized to instruction execution
  - Can invoke handler between instructions
  - Cause information often identifies the interrupting device
  - Priority interrupts
    - Devices needing more urgent attention get higher priority
    - Can interrupt handler for a lower priority interrupt

### **I/O Data Transfer**

- Polling and interrupt-driven I/O
  - CPU transfers data between memory and I/O data registers
  - Time consuming for high-speed devices
  - Direct memory access (DMA)
    - OS provides starting address in memory
    - I/O controller transfers to/from memory autonomously
    - Controller interrupts on completion or error

### **DMA/Cache Interaction**

- If DMA writes to a memory block that is cached
  - Cached copy becomes stale
- If write-back cache has dirty block, and DMA reads memory block
  - Reads stale data
  - Need to ensure cache coherence
    - Flush blocks from cache if they will be used for DMA
    - Or use non-cacheable memory locations for I/O

### **DMA/VM Interaction**

- OS uses virtual addresses for memory
  - DMA blocks may not be contiguous in physical memory
- Should DMA use virtual addresses?
  - Would require controller to do translation
- If DMA uses physical addresses
  - May need to break transfers into page-sized chunks
  - Or chain multiple transfers
  - Or allocate contiguous physical pages for DMA

### **Measuring I/O Performance**

- I/O performance depends on
  - Hardware: CPU, memory, controllers, buses
  - Software: operating system, database management system, application
  - Workload: request rates and patterns
- I/O system design can trade-off between response time and throughput
  - Measurements of throughput often done with constrained response-time

### I/O vs. CPU Performance

- Amdahl's Law
  - Don't neglect I/O performance as parallelism increases compute performance
- Example
  - Benchmark takes 90s CPU time, 10s I/O time
  - Double the number of CPUs/2 years
    - I/O unchanged

| Year | CPU time | I/O time | Elapsed time | % I/O time |
|------|----------|----------|--------------|------------|
| now  | 90s      | 10s      | 100s         | 10%        |
| +2   | 45s      | 10s      | 55s          | 18%        |
| +4   | 23s      | 10s      | 33s          | 31%        |
| +6   | 11s      | 10s      | 21s          | 47%        |

### **I/O System Design**

- Satisfying latency requirements
  - For time-critical operations
  - If system is unloaded
    - Add up latency of components
- Maximizing throughput
  - Find "weakest link" (lowest-bandwidth component)
  - Configure to operate at its maximum bandwidth
  - Balance remaining components in the system
- If system is loaded, simple analysis is insufficient
  - Need to use queuing models or simulation

### **Server Computers**

- Applications are increasingly run on servers
  - Web search, office apps, virtual worlds, …
- Requires large data center servers
  - Multiple processors, networks connections, massive storage
  - Space and power constraints
- Server equipment built for 19" racks
  - Multiples of 1.75" (1U) high

### **Rack-Mounted Servers**



#### Sun Fire x4150 1U server



### Sun Fire x4150 1U server



## **I/O System Design Example**

### Given a Sun Fire x4150 system with

- Workload: 64KB disk reads
  - Each I/O op requires 200,000 user-code instructions and 100,000 OS instructions
- Each CPU: 10<sup>9</sup> instructions/sec
- FSB: 10.6 GB/sec peak
- DRAM DDR2 667MHz: 5.336 GB/sec
- PCI-E 8× bus: 8 × 250MB/sec = 2GB/sec
- Disks: 15,000 rpm, 2.9ms avg. seek time, 112MB/sec transfer rate
- What I/O rate can be sustained?
  - For random reads, and for sequential reads

## **Design Example (cont)**

- I/O rate for CPUs
  - Per core: 10<sup>9</sup>/(100,000 + 200,000) = 3,333
  - 8 cores: 26,667 ops/sec
- Random reads, I/O rate for disks
  - Assume actual seek time is average/4
  - Time/op = seek + latency + transfer
    - = 2.9 ms/4 + 4 ms/2 + 64 KB/(112 MB/s) = 3.3 ms
  - 303 ops/sec per disk, 2424 ops/sec for 8 disks
- Sequential reads
  - 112MB/s / 64KB = 1750 ops/sec per disk
  - 14,000 ops/sec for 8 disks

## **Design Example (cont)**

- PCI-E I/O rate
  - 2GB/sec / 64KB = 31,250 ops/sec
- DRAM I/O rate
  - 5.336 GB/sec / 64KB = 83,375 ops/sec
- FSB I/O rate
  - Assume we can sustain half the peak rate
  - 5.3 GB/sec / 64KB = 81,540 ops/sec per FSB
  - 163,080 ops/sec for 2 FSBs
- Weakest link: disks
  - 2424 ops/sec random, 14,000 ops/sec sequential
  - Other components have ample headroom to accommodate these rates

### **Concluding Remarks**

- I/O performance measures
  - Throughput, response time
  - Dependability and cost also important
- Buses used to connect CPU, memory, I/O controllers
  - Polling, interrupts, DMA
- I/O benchmarks
  - TPC, SPECSFS, SPECWeb
- RAID
  - Improves performance and dependability