## Real-time scheduling of STM transactions on multi-core platforms

António Barros, Patrick Meumeu Yomsi, Luís Miguel Pinho CISTER seminar series xxth February 2015







## The problem Single core Memory

#### Well understood theory and practice on unicore platforms!

### The problem

Memory

#### Multi-core



## • Multiple cores (tens, hundreds,...)

F2

 $\mathbf{\nabla}\mathbf{\nabla}$ 

F1

- No cache-coherency
- Single memory bus

Current and future embedded architectures...

ry

F2

 $\mathbf{\nabla}\mathbf{\nabla}$ 

F1

COLE

Current and future embedded architectures... • Multiple cores (tens, hundreds,...) No cache-coherency Single memory bus

Maybe OK (?) for sets of independent tasks...

ry

F2

 $\overline{\mathbf{v}}$ 

F1

COLE

Current and future embedded architectures... • Multiple cores (tens, hundreds,...) No cache-coherency Single memory bus

Maybe OK (?) for sets of independent tasks...

What if tasks independent?

ry

#### Multi-core



#### Multi-core



#### **Multi-core**



Are independent tasks really independent?

### Practical case: DO-178C

- New programming paradigm (enhancing explicit dependencies between functionalities).
- Spatial and temporal isolation among functionalities, depending on their criticality.
- Functionalities must be statically assigned to cores.
- Data dependencies must be mapped.



### Attempted solutions

Recent proposal: FMLP\*

- Global resources can be short or long (designer's choice), depending on length of critical sections
  - When blocked: busy waits on short, suspends on long
- Nested requests dictates joining resources into resource groups
  - one lock (queue lock or semaphore) per group
  - exclusively short- and long-groups
- Critical section code is executed non-preemptively

\* A. Block, H. Leontyev, B. Brandenburg, and J. Anderson. A flexible real-time locking protocol for multiprocessors. In Proceedings of RTCSA 2007, pages 71–80, 2007.

### Attempted solutions

Our idea: STM + SRP-TM

- No groups and locks (at least, seen by the programmer)
- Contention is checked at run-time: just-in-time parallelism
- Upper-bound atomic section response time
- Limit blocking times

### Background

### Locks

Coarse-grained locking

### Locks

Coarse-grained locking





### Locks



### Locks

#### TASK 1

Get\_Lock(1)
Write(A, x)
Release\_Lock(1)



Locks

#### TASK 1

Get\_Lock(1)
Write(A, x)
Release\_Lock(1)

#### TASK 2

Get\_Lock(1)
Write(C, y)
Release\_Lock(1)



Locks

#### TASK 1

Get\_Lock(1)
Write(A, x)
Release\_Lock(1)

# TASK 2 Get\_Lock(1) Write(C, y) Release\_Lock(1)



### Locks

#### TASK 1

Get\_Lock(1)
Write(A, x)
Release\_Lock(1)

#### TASK 2

Get\_Lock(1)
Write(C, y)
Release\_Lock(1)





Critical sections can not progress in parallel!

### Locks

#### TASK 1

Get\_Lock(1)
Write(A, x)
Release\_Lock(1)

#### TASK 2

Get\_Lock(1)
Write(C, y)
Release\_Lock(1)

### Locks

• Fine-grained locking





### Locks



### Locks



### Locks

#### TASK 1

Get\_Lock(1)
Get\_Lock(3)
Read(C, x)
Write(A, x)
Release\_Lock(3)
Release\_Lock(1)



Locks

#### TASK 1

Get\_Lock(1)
Get\_Lock(3)
Read(C, x)
Write(A, x)
Release\_Lock(3)
Release\_Lock(1)

#### TASK 2

Get\_Lock(3)
Get\_Lock(1)
Read(B, x)
Write(C, y)
Release\_Lock(1)
Release\_Lock(3)



Locks

#### TASK 1

Get\_Lock(1)
Get\_Lock(3)
Read(C, x)
Write(A, x)
Release\_Lock(3)
Release\_Lock(1)

#### TASK 2

Get\_Lock(3)
Get\_Lock(1)
Read(B, x)
Write(C, y)
Release\_Lock(1)
Release\_Lock(3)



Increases system complexity with a negative impact on composability and maintainability!



• Fine-grained locking

Locks

#### TASK 1

Get\_Lock(1)
Get\_Lock(3)
Read(C, x)
Write(A, x)
Release\_Lock(3)
Release\_Lock(1)

#### TASK 2

Get\_Lock(3)
Get\_Lock(1)
Read(B, x)
Write(C, y)
Release\_Lock(1)
Release\_Lock(3)







#### TASK 1

Transaction()
Write(A, x)
Commit()



#### TASK 1

Transaction()
Write(A, x)
Commit()

#### TASK 2

Transaction()
Write(C, y)
Commit()





#### TASK 1

Transaction()
Write(A, x)
Commit()

#### TASK 2

Transaction()
Write(C, y)
Commit()





#### TASK 1

Transaction()
Write(A, X)
Commit()

#### TASK 2

Transaction()
Write(C, y)
Commit()





### TASK 1

Transaction()
Write(A, X)
Commit()

## TASK 2









### TASK 1





### TASK 1

Transaction()
Write(E, x)
Commit()

### TASK 2





### TASK 1

Transaction()
Write(E, x)
Commit()

### TASK 2





### TASK 1

Transaction()
Write(E, X)
Commit()

### TASK 2





### TASK 1

Transaction()
Write(E, x)
Commit()

### TASK 2

Transaction()
Write(E, y)
Commit()

### TASK 2





### TASK 1

Transaction()
Write(E, x)

Commit()

TASK 2

Transaction()
Write(E, y)
Commit()

### TASK 2





### TASK 1

Transaction()
Write(E, x)

Commit()

TASK 2

Transaction()
Write(E, y)
Commit()

### TASK 2



### Polite Exponential back-off, eventually commit.



### Aggressive Kill the enemy!!!





### Polite Exponential back-off, eventually commit.

Aggressive Kill the enemy!!!

**Randomized** Abort with p or Wait with (1-p).



### **Timestamp** Older transaction survives.



g**ressive** e enemy!!!





### Polite Exponential back-off, eventually commit.

Aggressive Kill the enemy!!!

**Randomized** Abort with p or Wait with (1-p).

Karma Accesses and aborts accounts for karma.





### Aggressive Kill the enemy!!!

mized /ith p or :h (1-p).

> Eruption Priority rises if others are waiting.





# Managing contention DETERMINISTIC **NOT DETERMINISTIC**

Timestamp Older transaction survives.

### Polite Exponential back-off, eventually commit.

Aggressive Kill the enemy!!!

Randomized Abort with p or Wait with (1-p).

Karma Accesses and aborts accounts for karma.

Eruption Priority rises if others are waiting.





Model of computation and scheduling strategy

# Computation platform

- Multi-core
- Single memory bus
- Data shared in globally accessed memory, controlled by a STM system

# Application characteristics

- Application functionality divided into tasks.
- Each task is statically assigned to a core, before run-time.
- Each task releases a potentially infinite number of jobs.
  - Task: C (execution time), T (period), D (deadline)
  - Job: r (release time), d (absolute deadline)



# Application characteristics

- Application functionality divided into tasks.
- Each task is statically assigned to a core, before run-time.
- Each task releases a potentially infinite number of jobs.
  - Task: C (execution time), T (period), D (deadline)
  - Job: r (release time), d (absolute deadline)



# Serialisation of transactions in a RT environment

Problem solved!

 The order of serialisation of to once a transaction starts!

• The order of serialisation of transactions in progress is determined

Problem solved!

- The order of serialisation of to once a transaction starts!
- ... or maybe not!

• The order of serialisation of transactions in progress is determined

Problem solved!

- once a transaction starts!
- ... or maybe not!

• The order of serialisation of transactions in progress is determined

What if jobs can be preempted while executing a transaction?

Problem solved!

- once a transaction starts!
- ... or maybe not!

  - the same core?

• The order of serialisation of transactions in progress is determined

• What if jobs can be preempted while executing a transaction?

What if multiple transactions can be simultaneously in progress on

# Preemptions and serialisation

### Core 2





# Preemptions and serialisation

### Write(A, 2) Write(A, 2)

### Core 2







# Preemptions and serialisation

## Write(A, 2) Write(A, 2)

### Core 2





# Preemptions and serialisation Write(A, 2) Write(A, 2)

### Core 2





# Preemptions and serialisation Write(A, 2) Write(A, 2) Core 2



Write(A, 3)







# Preemptions and serialisation Write(A, 2) Write(A, 2)

### Core 2









# What to do?

Increase resistance to preemptions if a transaction can affect concurrent parallel transactions in jobs, while meeting all timing requirements.

Restrict to, at most, ONE transaction in progress, per core.

- No deadlocks.
- No transgression to FIFO serialisation.

# What to do?

Increase resistance to preemptions if a transaction can affect concurrent parallel transactions in jobs, while meeting all timing requirements.

Restrict to, at most, ONE transaction in progress, per core.

- No deadlocks.
- No transgression to FIFO serialisation.



# Scheduling jobs with transactions: SRP-TM

# Assumptions

### General scheduling rule: P-EDF.

While a transaction is in progress on a core: SRP  $\Rightarrow$  SRP-TM.

- Adds static preemption levels to tasks.
- Adds static preemption level to transactions.
- Adds variable ceiling to cores.
  - Highest preemption level of a task that could be waiting for the current transaction in progress to commit.

## Assigning preemption levels to tasks

Just like SRP, assign preemption levels to all tasks in set by increasing order of relative deadline...

... independently of core affinities.

| Task   | Relative deadline | Preemption<br>level |
|--------|-------------------|---------------------|
| T5     | 120               | 1                   |
| T2     | 100               | 2                   |
| T3     | 80                | 3                   |
| Τ4     | 70                | 4                   |
| T6, T7 | 60                | 5                   |
| T1     | 50                | 6                   |

### Assigning preemption levels to transaction

tasks that have one transaction that may depend on it to progress.

 $T_1 @ core 1$  $DS_{1} = \{A\}$ 

 $T_2 @ core 2$  $DS_2 = \{A, B\}$ 

 $T_3 @ \text{ core } 3$  $DS_3 = \{B\}$ 

Assign to each transaction the highest preemption level from all



### Assigning preemption levels to transaction

Assign to each transaction the highest preemption level from all tasks that have one transaction that may depend on it to progress.

 $T_1 @ core 1$  $DS1 = \{A\}$ 

 $T_2 @ core 2$  $DS_2 = \{A, B\}$ 

 $T_3 @ \text{ core } 3$  $DS_3 = \{B\}$ 





### Assigning preemption levels to transaction

Assign to each transaction the highest preemption level from all tasks that have one transaction that may depend on it to progress.









| cal exar                                               | mple                                                                           |
|--------------------------------------------------------|--------------------------------------------------------------------------------|
| $D = 120 \\ (1, 1) \\ \Omega 1 \\ \omega_5$            | $DS1 = DS5 = \{03\}$<br>$DS2 = \{01\}$<br>$DS3 = \{01, 02\}$<br>$DS4 = \{02\}$ |
| $   \begin{aligned}                                  $ |                                                                                |















### Transaction vs. Transactionless



## Transaction vs. Transactionless















## Transaction vs. Transaction







# Transaction vs. Transaction







0

# Transaction vs. Transaction





# Mixing all together















# SRP-TM operations in short

Transaction starts:

Transaction commits:

• Core ceiling is reset to zero.

• Core ceiling is set to the preemption level of the transaction.

### SRP-TM scheduling decisions in short

- Job in front of ready queue has transaction:
  - Core ceiling is raised to the preemption level of this task.
  - Job with transaction in progress is executed on behalf of job in front of ready queue.
- Job in front of ready queue does not have transaction:
  - Preempt running job iff has earlier absolute deadline than running job, and higher preemption level than core ceiling.

### Response time of a transaction $T_1 @ core 1$ $DS_{1} = \{A\}$ OK free to commit $T_2 @ core 2$ $DS_2 = \{A, B\}$ abort OK free to commit $T_3 @ \text{ core } 3$ $\mathsf{DS3} = \{\mathsf{B}\}$ abort abort OK abort





Transaction response time...









 The response time of the last transaction in a sequence of the last two attempts, for each transaction in the sequence.



transactions is upper bounded by the sum of the response time of

 The response time of the last transaction in a sequence of the last two attempts, for each transaction in the sequence.



transactions is upper bounded by the sum of the response time of

- interference:
  - **IT CAN BE ANALYTICALLY UPPER BOUNDED!** •
- Maximum response time of a transaction...
  - and choose the maximum value... **COMBINATIONAL ORDER!!!**
  - **PESSIMISTIC, but LINEAR ORDER!**

• The response time of the last two transactions depends exclusively on intra-core

### • Determine every possible sequence, sum response times of last two attempts

• For every processor, choose the maximum response time of last two attempts of a transaction that belongs to the same contention group, and sum them all...

Response time of a task





# Blocking and interference

Interference

Interference + IB



di,j



# Blocking and interference



# Simulation results

# Simulation conditions

- Scheduling policies:
  - pure P-EDF
  - NPUC
  - NPDA

• FLMP

- SRP-TM

## Simulation conditions

- Experiment 1: varying system size
  - Variable number of cores:  $m \in \{2, 4, 8, 16, 32, 64\}$
  - Number of transactional objects linear with  $m: p \in \{5, 10, 20, 40, 80, 160\}$ , so each object is accessed by 3 task, on average.
  - Number of contention groups linear with m: g ∈ {1, 2, 4, 8, 16, 32}, so each group maintains the same size and the same expected number of tasks.

## Simulation conditions

- Experiment 2: varying size of contention groups
  - Constant number of cores: m = 64.
  - Constant number of transactional objects linear: p = 160.
  - Variable number of contention groups: g ∈ {1, 2, 4, 8, 16, 32}, so to observe the effects of granularity of contention groups for systems with same size.



SRPTM Ο FMLP 

PEDF

NPUC

NPDA

 $\mathbf{\nabla}$ 

Ð

| (8, 4)   | 80% | 68% | 86% | 82% | 38% | (64, 16)         | 4% | 0% | 0% | 0% |
|----------|-----|-----|-----|-----|-----|------------------|----|----|----|----|
| (16, 8)  | 66% | 30% | 68% | 60% | 6%  | (64, 8)          | 0% | 0% | 0% | 0% |
| (32, 16) | 34% | 0%  | 12% | 12% | 0%  | (64, 4)          | 6% | 0% | 0% | 2% |
| (64, 32) | 2%  | 0%  | 0%  | 0%  | 0%  | <b>T</b> (64, 2) | 6% | 0% | 6% | 2% |
|          |     |     |     |     |     | (64, 1)          | 4% | 0% | 2% | 2% |
|          |     |     |     |     |     | $\mathbf{r}$     |    |    |    |    |
|          |     |     |     |     |     | $\mathbf{U}$     |    |    |    |    |





Experiment 2

0% 0% 0% 0%

✓ PEDF
 ● NPUC
 ● NPDA
 ● SRPTM
 ● FMLP

| (8, 4)   | 34826  | 35215  | 34696  | 30713  | 81666   | (64, 16) | 443036 | 429845 | 413484 | 387422 | 12322 |
|----------|--------|--------|--------|--------|---------|----------|--------|--------|--------|--------|-------|
| (16, 8)  | 85609  | 95317  | 90535  | 68950  | 220041  | (64, 8)  | 473085 | 454010 | 433336 | 404777 | 13444 |
| (32, 16) | 188286 | 202247 | 193040 | 156309 | 459582  | (64, 4)  | 475156 | 459013 | 439262 | 405838 | 13692 |
| (64, 32) | 432176 | 438299 | 423635 | 373730 | 1006744 | (64, 2)  | 484011 | 444793 | 430496 | 421342 | 13414 |
|          |        |        |        |        |         | (64, 1)  | 503295 | 471157 | 454371 | 438009 | 1383  |





Experiment 2



| (8, 4)   | 1221  | 2562  | 1948  | 607   | 5718   | (64, 16)  | 24273 | 51199 | 46335 | 17629 | 6228  |
|----------|-------|-------|-------|-------|--------|-----------|-------|-------|-------|-------|-------|
| (16, 8)  | 3225  | 6973  | 6549  | 1535  | 17392  | (64, 8)   | 33908 | 55863 | 48934 | 24574 | 12004 |
| (32, 16) | 6425  | 17433 | 14948 | 5258  | 45713  | (64, 4)   | 26135 | 55589 | 51026 | 17298 | 13464 |
| (64, 32) | 28501 | 53829 | 49013 | 23435 | 179048 | ) (64, 2) | 25766 | 48684 | 43582 | 17525 | 13384 |
|          |       |       |       |       |        | (64, 1)   | 26879 | 57208 | 49244 | 20375 | 13829 |
|          |       |       |       |       |        | ב         |       |       |       |       |       |

ノ



Experiment 1



Experiment 2



| (4, 2)   | 24% | 21% | 21% | 13% | 33% | (64, 16) | 35% | 65% | 66% | 25% | 141   |
|----------|-----|-----|-----|-----|-----|----------|-----|-----|-----|-----|-------|
| (8, 4)   | 29% | 31% | 31% | 20% | 40% | (64, 8)  | 36% | 67% | 67% | 26% | 323   |
| (16, 8)  | 32% | 58% | 59% | 21% | 63% | (64, 4)  | 36% | 68% | 69% | 26% | 682   |
| (32, 16) | 32% | 58% | 58% | 22% | 63% | (64, 2)  | 37% | 67% | 67% | 27% | 1 451 |
| (64, 32) | 33% | 62% | 62% | 23% | 69% | (64, 1)  | 37% | 70% | 70% | 26% | 3 074 |
|          |     |     |     |     |     |          |     |     |     |     |       |





Experiment 2

141% 323% 682% 451% 074%



| (4, 2)   | 24%  | 21%  | 21%  | 13%  | 33%  |         |     |     |     |     |     |
|----------|------|------|------|------|------|---------|-----|-----|-----|-----|-----|
| (8, 4)   | 29%  | 31%  | 31%  | 20%  | 40%  | (64, 8) | 36% | 67% | 67% | 26% | 3   |
| (16, 8)  | 32%  | 58%  | 59%  | 21%  | 63%  | (64, 4) | 36% | 68% | 69% | 26% | 6   |
| (32, 16) | 32%  | 58%  | 58%  | 22%  | 63%  | (64, 2) | 37% | 67% | 67% | 27% | 14  |
| (64, 32) | 33%  | 62%  | 62%  | 23%  | 69%  | (64, 1) | 37% | 70% | 70% | 26% | 3 ( |
| (04, 32) | 5570 | 0270 | 0270 | 23/0 | 0370 |         |     |     |     |     |     |





#### Experiment 2





Wrapping up

# Conclusion (1/2)

- FIFO serialisation is the predictable and fair.
- Scheduling has an effect on the performance of transactions.
  - SRP-TM extends P-EDF when a transaction is in progress.
    - Takes into account **possible** concurrent parallel transactions with earlier deadlines, **without sharing** scheduling data between cores.
    - Allows jobs with earlier deadlines to preempt or speed up a transaction in progress.

## Conclusion (2/2)

- We provide an analytical method of transactions under SRP-TM.
- We provide an analytical method of tasks under SRP-TM.

We provide an analytical method to upper bound the response time

• We provide an analytical method to upper bound the response time

### That's it! Thanks! Questions?