# Variation Aware Application Scheduling and Power Management for Chip Multiprocessors

Radu Teodorescu\* and Josep Torrellas

Computer Science Department University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

i-acoma

\*now at Ohio State University





size

2

Variation-Aware Application Scheduling and Power Management







## Variation in transistor parameters

















#### die-to-die













🗕 group









4 Radu Teodorescu

Variation-Aware Application Scheduling and Power Management









• CMPs: significant core-to-core variation in frequency and power







- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm







- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm







- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm









- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm









5



- CMPs: significant core-to-core variation in frequency and power  ${\color{black}\bullet}$
- We model a 20-core CMP, 32nm







fastest

slowest

C2

VS.

C20



- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm
- On average:





C20

C2

slowest





- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm
- On average:







- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm
- On average:







- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm
- On average:





- CMPs: significant core-to-core variation in frequency and power
- We model a 20-core CMP, 32nm
- On average:











• Current CMPs run at the frequency of the slowest core



group

Variation-Aware Application Scheduling and Power Management

- Current CMPs run at the frequency of the slowest core
- We can run each core at the maximum frequency it can achieve





- Current CMPs run at the frequency of the slowest core
- We can run each core at the maximum frequency it can achieve
  - 15% average frequency increase





- Current CMPs run at the frequency of the slowest core
- We can run each core at the maximum frequency it can achieve
  - 15% average frequency increase
  - Support present in AMD's 4-core Opteron





- Current CMPs run at the frequency of the slowest core
- We can run each core at the maximum frequency it can achieve
  - 15% average frequency increase
  - Support present in AMD's 4-core Opteron
- Heterogeneous system









7 Radu Teodorescu



• Expose variation in core frequency and power to the OS





- Expose variation in core frequency and power to the OS
- Variation-aware application scheduling algorithms





- Expose variation in core frequency and power to the OS
- Variation-aware application scheduling algorithms
- Variation-aware global power management subsystem





- Expose variation in core frequency and power to the OS
- Variation-aware application scheduling algorithms
- Variation-aware global power management subsystem
  - On-line optimization algorithm that maximizes system performance at a power budget





- Expose variation in core frequency and power to the OS
- Variation-aware application scheduling algorithms
- Variation-aware global power management subsystem
  - On-line optimization algorithm that maximizes system performance at a power budget
  - 12-17% CMP throughput improvement at the same power





#### Outline





• Variation-aware scheduling





- Variation-aware scheduling
- Variation-aware power management
  - Defining the optimization problem
  - Implementation



- Variation-aware scheduling
- Variation-aware power management
  - Defining the optimization problem
  - Implementation
- Evaluation





- Variation-aware scheduling
- Variation-aware power management
  - Defining the optimization problem
  - Implementation
- Evaluation
- Conclusions





## • Variation-aware scheduling

- Variation-aware power management
  - Defining the optimization problem
  - Implementation
- Evaluation
- Conclusions





| СІ       | C2  | C3  | C4  | C5  |  |  |
|----------|-----|-----|-----|-----|--|--|
| L2 Cache |     |     |     |     |  |  |
| C6       | C7  | C8  | C9  | C10 |  |  |
| СП       | CI2 | CI3 | CI4 | CI5 |  |  |
| L2 Cache |     |     |     |     |  |  |
| C16      | CI7 | C18 | CI9 | C20 |  |  |



## 1

## Variation-aware scheduling



















Additional information to guide scheduling decisions:

• Per core frequency and static power







- Per core frequency and static power
- Application behavior







- Per core frequency and static power
- Application behavior
  - Dynamic power consumption







- Per core frequency and static power
- Application behavior
  - Dynamic power consumption
  - Compute intensity (IPC)





- Per core frequency and static power
- Application behavior
  - Dynamic power consumption
  - Compute intensity (IPC)
- Multiple possible goals:







- Per core frequency and static power
- Application behavior
  - Dynamic power consumption
  - Compute intensity (IPC)
- Multiple possible goals:
  - Reduce power





- Per core frequency and static power
- Application behavior
  - Dynamic power consumption
  - Compute intensity (IPC)
- Multiple possible goals:
  - Reduce power
  - Improve performance





When the goal is to reduce power consumption:







When the goal is to reduce power consumption:

#### • VarP

Assign applications to low static power cores first



group



When the goal is to reduce power consumption:

#### • VarP

Assign applications to low static power cores first







When the goal is to reduce power consumption:

#### • VarP

Assign applications to low static power cores first

## VarP&AppP

Assign applications with high dynamic power to low static power cores







When the goal is to reduce power consumption:

#### • VarP

Assign applications to low static power cores first

## • VarP&AppP

Assign applications with high dynamic power to low static power cores









When the goal is to reduce power consumption:

#### • VarP

Assign applications to low static power cores first

## VarP&AppP

Assign applications with high dynamic power to low static power cores







When the goal is to reduce power consumption:

## • VarP

Assign applications to low static power cores first

## • VarP&AppP

Assign applications with high dynamic power to low static power cores

## dynamic power







When the goal is to improve performance:







When the goal is to improve performance:

## • VarF

Assign applications to high frequeny cores first







When the goal is to improve performance:

## • VarF

Assign applications to high frequeny cores first







When the goal is to improve performance:

## • VarF

Assign applications to high frequeny cores first

## VarF&AppIPC







When the goal is to improve performance:

## • VarF

Assign applications to high frequeny cores first

## VarF&AppIPC









When the goal is to improve performance:

## • VarF

Assign applications to high frequeny cores first

## VarF&AppIPC







When the goal is to improve performance:

#### • VarF

Assign applications to high frequeny cores first

#### VarF&AppIPC









- Variation-aware scheduling
- Variation-aware power management
  - Defining the optimization problem
  - Implementation
- Evaluation
- Conclusions





# Variation-aware global power management

| СІ       | C2  | C3  | C4  | C5  |  |  |
|----------|-----|-----|-----|-----|--|--|
| L2 Cache |     |     |     |     |  |  |
| C6       | C7  | C8  | C9  | C10 |  |  |
| СП       | CI2 | CI3 | CI4 | CI5 |  |  |
| L2 Cache |     |     |     |     |  |  |
| CI6      | CI7 | C18 | CI9 | C20 |  |  |





# Variation-aware global power management

CMP power management







# Variation-aware global power management

CMP power management

• Per core dynamic voltage and frequency scaling (DVFS)







# Variation-aware global power management

CMP power management

- Per core dynamic voltage and frequency scaling (DVFS)
- Challenge: find best (V,F) for each core







# Variation-aware global power management

CMP power management

- Per core dynamic voltage and frequency scaling (DVFS)
- Challenge: find best (V,F) for each core
  - Core-level decisions less effective in large CMPs





# Variation-aware global power management

CMP power management

- Per core dynamic voltage and frequency scaling (DVFS)
- Challenge: find best (V,F) for each core
  - Core-level decisions less effective in large CMPs
  - Global (CMP-wide) power management solution is needed





# Variation-aware global power management

CMP power management

- Per core dynamic voltage and frequency scaling (DVFS)
- Challenge: find best (V,F) for each core
  - Core-level decisions less effective in large CMPs
  - Global (CMP-wide) power management solution is needed

Variation makes the problem more difficult



































Given a mapping of threads to cores (variation-aware):







# Given a mapping of threads to cores (variation-aware):







# Given a mapping of threads to cores (variation-aware):

best  $(V_i, F_i)$  of each core

group







# Given a mapping of threads to cores (variation-aware):

best  $(V_i, F_i)$  of each core



• Goal: maximize system throughput (MIPS)





# Given a mapping of threads to cores (variation-aware):

best  $(V_i, F_i)$  of each core



- Goal: maximize system throughput (MIPS)
- **Constraint:** keep total power below budget





# Given a mapping of threads to cores (variation-aware):

best  $(V_i, F_i)$  of each core



- Goal: maximize system throughput (MIPS)
- **Constraint:** keep total power below budget





100W

16Radu Teodorescu

Variation-Aware Application Scheduling and Power Management

50W



# Given a mapping of threads to cores (variation-aware):

best  $(V_i, F_i)$  of each core



- Goal: maximize system throughput (MIPS)
- **Constraint:** keep total power below budget
- **Runtime** system adaptation







00W



# Given a mapping of threads to cores (variation-aware):

best  $(V_i, F_i)$  of each core



- Goal: maximize system throughput (MIPS)
- **Constraint:** keep total power below budget
- **Runtime** system adaptation







00W

Radu Teodorescu

16



?













• Exhaustive search: too expensive







- Exhaustive search: too expensive
- Simulated annealing (SAnn)
  - Not practical at runtime







- Exhaustive search: too expensive
- Simulated annealing (SAnn)
  - Not practical at runtime
- Linear programming (*LinOpt*)
  - Simpler, faster
  - Requires some approximations





LinOpt

- Exhaustive search: too expensive
- Simulated annealing (SAnn)
  - Not practical at runtime
- Linear programming (*LinOpt*)
  - Simpler, faster
  - Requires some approximations





#### Outline

- Variation-aware scheduling
- Variation-aware power management
  - Defining the optimization problem
  - Implementation
- Evaluation
- Conclusions













- Linear programming:
  - Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent



- Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent
- Subject to constraints such as:  $g(x_1,...,x_n) < C$



- Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent
- Subject to constraints such as:  $g(x_1,...,x_n) < C$
- f,g are linear functions



- Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent
- Subject to constraints such as:  $g(x_1,...,x_n) < C$
- f,g are linear functions
- **Unknowns:** voltages  $V_{1,...,}V_n$  for all cores

- Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent
- Subject to constraints such as:  $g(x_1,...,x_n) < C$
- f,g are linear functions
- **Unknowns:** voltages *V*<sub>1</sub>,...,*V*<sub>n</sub> for all cores
- **Objective function:** maximize CMP throughput

- Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent
- Subject to constraints such as:  $g(x_1,...,x_n) < C$
- f,g are linear functions
- **Unknowns:** voltages V<sub>1</sub>,...,V<sub>n</sub> for all cores
- **Objective function:** maximize CMP throughput
  - Throughput (MIPS) = Frequency X IPC =  $f(V_1,...,V_n)$



- Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent
- Subject to constraints such as:  $g(x_1,...,x_n) < C$
- f,g are linear functions
- **Unknowns:** voltages  $V_{1,...,}V_n$  for all cores
- **Objective function:** maximize CMP throughput
  - Throughput (MIPS) = Frequency X IPC =  $f(V_1,...,V_n)$
- **Constraint:** keep power under P<sub>target</sub>



- Maximize objective function:  $f(x_1,...,x_n)$ , with  $x_1,...,x_n$  independent
- Subject to constraints such as:  $g(x_1,...,x_n) < C$
- f,g are linear functions
- **Unknowns:** voltages  $V_{1,...,}V_n$  for all cores
- **Objective function:** maximize CMP throughput
  - Throughput (MIPS) = Frequency X IPC =  $f(V_1,...,V_n)$
- **Constraint:** keep power under P<sub>target</sub>
  - Power = g(V)







- LinOpt works together with the OS scheduler
  - OS scheduler maps applications to cores (e.g. VarF&AppIPC)
  - LinOpt then finds (V,F) settings for each core







- LinOpt works together with the OS scheduler
  - OS scheduler maps applications to cores (e.g. VarF&AppIPC)
  - LinOpt then finds (V,F) settings for each core
- LinOpt runs periodically as a system process







- LinOpt works together with the OS scheduler
  - OS scheduler maps applications to cores (e.g. VarF&AppIPC)
  - LinOpt then finds (V,F) settings for each core
- LinOpt runs periodically as a system process
  - On a core







- LinOpt works together with the OS scheduler
  - OS scheduler maps applications to cores (e.g. VarF&AppIPC)
  - LinOpt then finds (V,F) settings for each core
- LinOpt runs periodically as a system process
  - On a core
  - Power management unit (PMU) e.g., Foxton







- LinOpt works together with the OS scheduler
  - OS scheduler maps applications to cores (e.g. VarF&AppIPC)
  - LinOpt then finds (V,F) settings for each core
- LinOpt runs periodically as a system process
  - On a core
  - Power management unit (PMU) e.g., Foxton
- LinOpt uses profile information as input









Post-manufacturing profiling

Each core: frequency, static power





Post-manufacturing profiling

Each core: frequency, static power

Dynamic profiling

Each app: dynamic power, IPC































#### Outline

- Variation-aware scheduling
- Variation-aware power management
  - Defining the optimization problem
  - Implementation
- Evaluation
- Conclusions



### I

#### **Evaluation infrastructure**

- Process variation model VARIUS [IEEE TSM'08]
  - Monte Carlo simulations for 200 chips
- SESC cycle accurate microarchitectural simulator
- SPICE model leakage power
- Hotspot temperature estimation





#### Evaluation infrastructure



- 20-core CMP
  - 2-issue, OOO cores
  - Shared L2 cache
- 32nm technology, 4GHz

- Multiprogrammed workload:
  - From a pool of SPECint and SPECfp benchmarks









**Goal:** Improve CMP throughput





**Goal:** Improve CMP throughput

















#### **Goal:** Improve CMP throughput







• VarF: up to 9% throughput improvement over Naive



aroup



- VarF: up to 9% throughput improvement over Naive
- VarF&AppIPC scales better with number of threads: 5-10% improvement over Naive

Variation-Aware Application Scheduling and Power Management



Global power management algorithms:

- Goal: maximize throughput
- Constraint: keep power below budget (75W)





Global power management algorithms:

- Goal: maximize throughput
- Constraint: keep power below budget (75W)

Foxton+: baseline





Global power management algorithms:

- Goal: maximize throughput
- Constraint: keep power below budget (75W)

Foxton+: baseline

LinOpt: proposed scheme





Global power management algorithms:

- Goal: maximize throughput
- Constraint: keep power below budget (75W)

Foxton+: baseline

LinOpt: proposed scheme

**SAnn:** approximate upper bound



















• LinOpt: 12-17% improvement over Foxton+, at the same power





- LinOpt: 12-17% improvement over Foxton+, at the same power
  - 30-38% reduction in ED<sup>2</sup>





- LinOpt: 12-17% improvement over Foxton+, at the same power
  - 30-38% reduction in ED<sup>2</sup>
- LinOpt within 2% of SAnn









◆ 50W ◆ 75W ◆ 100W















Number of threads

Low overhead even for large problem size 

aroup



Number of threads

- Low overhead even for large problem size
  - Up to 6 µs for 20 threads





Number of threads

- Low overhead even for large problem size
  - Up to 6 µs for 20 threads
  - LinOpt runs on a core every I-10 ms negligible impact

28

Variation-Aware Application Scheduling and Power Management





#### Conclusions

- We showed the value of exposing variation in core frequency and power to the OS
- Proposed a set of scheduling algorithms
  - reduce CMP power consumption (2-16%)
  - improve CMP throughput (5-10%)
- Proposed a power management algorithm
  - improve CMP throughput for a given power budget (12-17%)



# Variation Aware Application Scheduling and Power Management for Chip Multiprocessors

Radu Teodorescu\* and Josep Torrellas

Computer Science Department University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

i-acoma

\*now at Ohio State University

