### from seed

Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems

João Colaço, **Adrian Matoga**, Aleksandar Ilic, Nuno Roma, Pedro Tomás, Ricardo Chaves *adrian.matoga@inesc-id.pt* 

Signal Processing Systems INESC-ID / IST Portugal



#### **PPAM 2013** – 10th International Conference on Parallel Processing and Applied Mathematics September 8 – 11, 2013, Warsaw, Poland

ISBOA

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems September 9, 2013



• **Transparent** application acceleration by *shared library interposing* 



GPU, FPGA...

First proposed by Beisel et al. 2010

ECNICO

2



- Select the optimal implementation using a dynamic performance model.
- Where possible, balance the load across multiple processors.



3

#### Outline



technology from seed

- Framework Architecture
  - Selection and Partitioning policies
  - Library Generation
- Run-time adaptive policies
  - Best-performance selection
  - Load balancing
- Experimental results
  - Accelerating BLAS and FFTW
  - Overheads

#### Outline



technology from seed

### Framework Architecture

- Selection and Partitioning policies
- Library Generation

**Run-time adaptive policies** 

**Best-performance selection** 

Load balancing

**Experimental results** 

Accelerating BLAS and FFTW Overheads



5

#### **Framework Architecture**



technology from seed





6





## Legacy applications and libraries do not fully utilize modern heterogeneous computers



7



#### **Framework Architecture**



#### Calls to shared libraries can be intercepted...



8



#### **Framework Architecture**



# Calls to shared libraries can be intercepted to choose an optimized implementation for a particular call based on the problem size...



9

#### **Framework Architecture**





Calls to shared libraries can be intercepted to choose an optimized implementation for a particular call based on the problem size and execute it.



10

#### **Framework Architecture**





Some tasks can be partitioned so that multiple devices execute their portions simultaneously.



11

#### **Framework Architecture**





#### Partitioning depends on the particular function.



#### Outline



technology from seed

Framework Architecture Selection and Partitioning policies Library Generation

- Run-time adaptive policies
  - Best-performance selection
  - Load balancing

**Experimental results** 

Accelerating BLAS and FFTW Overheads



## Best performance selection policy (for indivisible workloads)

- Input problem size
- Output which *plugin* to run?
- Performance model ordered set of pairs (problem size, execution time)
- For a given input problem size, find an exact match or the two neighboring points in the performance model.

isboa

technology

from seed

- Which *plugin* offers the best performance?
- If cannot determine, run *both*.
- Update the model after the call finishes.



## Best performance selection policy Example

Execution time

Problem size

**TÉCNICO** LISBOA technology

lisboa

from seed



from seed

- For workloads that can be efficiently partitioned among multiple devices.
- Used Lastovetsky and Reddy's algorithm (*Functional Performance Models*).
- The optimal overall execution time is when all devices finish at the same time.
- Input problem size.
- Output partial sizes of portions to be assigned to successive devices.
- If the model is empty split equally.



#### Outline



technology from seed

Framework Architecture Selection and Partitioning policies Library Generation Run-time adaptive policies Best-performance selection Load balancing

### Experimental results

- Accelerating BLAS and FFTW
- Overheads

#### **Experimental setup**



technology from seed

- CPU: Intel i7-950 (4 cores, 3.07 GHz)
- RAM: 12 GB DDR3-1033
- GPU: 2x NVIDIA GTX 580
- Reference implementations (CPU only):
  - MKL BLAS
  - FFTW3
- Application: Octave

#### BLAS dgemm Execution time



technology

from seed



Matrix dimension

JI TÉCNICO LISBOA

20

Execution time (us)

#### BLAS dgemm Speedup over MKL with 4 cores

lisboa 🚲

technology

from seed

<sub>8</sub> <u>Speed-up vs MKL</u> (4 cores) BPS policy LB policy (4 cores + <u>1 GPU</u>) 7. LB policy (4 cores + 2 GPUs) 6 5 4 3 2 1000 1247 1701 2033 2612 3421 3907 4028 4500 5193 5698 6280 6790 8192 8703 9000 # Columns (Square Matrix)

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

T**ÉCNICO** LISBOA

#### FFT 2D Speedup over FFTW



technology

from seed

Speed-up vs FFTW 5 4 3 2 1 0 100 124 10 203 262 342 390 W2 100 513 668 678 619 619 619 600 2D FFT Size (N x N)

**TÉCNICO** LISBOA

22

## Temporal diagram with load balancing

**ÉCNICO** ISBOA

23



technology

from seed

A Transfer A ( $H \rightarrow D$ ) B Transfer B ( $H \rightarrow D$ ) C Transfer C ( $D \rightarrow H$ )

Total overhead (Total ammount = 242 µs)



BLAS dgemm function, multiplying two 8703x8703 matrices.

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems September 9, 2013

#### **Overheads**

TÉCNICO LISBOA



technology from seed

Source of overhead Time # Library function interception, redirection and return С 0.16 µs 0.34 µs BPS Model update 1 36 µs Thread dispatch D Selection С 3.16 µs LB 0.42 µs Model update С Computing the distribution 25.05 µs С  $C \times D$ 36 µs Thread dispatch cuFFT initialization 1.3 s1 cuBLAS initialization 1 0.273 s

C – number of calls for a given work size D – number of devices



- **Transparently** accelerate existing applications without any modifications to them.
- Use dynamic, adaptive scheduling and partitioning policies.
- Speedup up to 7.86 (matrix multiplication) and 4.6 (FFT).



25

Summary



### **Questions?**





#### Library generation



technology from seed





#### **Library generation**



technology from seed



Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

29