#### CS3400 - Principles of Software Engineering Software Engineering for Multicore Systems

# V. Krishna NandivadaIIT MadrasV. Krishna Nandivada (IT Madras)

# A pattern language for parallel programs



# Overall big picture

V.Krishna Nandivada (IIT Madras)



Part I

Patterns

CS3400

CS3400

# Finding concurrency and Algorithm Structure



#### Implementation Mechanisms



#### **UE** management

- UE unit of execution (a process / thread / activity)
- Difference between process / thread / activity.
- Management = Creation, execution, termination.
- Varies with different underlying languages.
- Go back to first few lectures for a recap.

#### Synchronization: Memory synchronization and fences

CS3400



۲

done=true;

while(done) ;

V.Krishna Nandivada (IIT Madras)

- done = false;
- Value may be present in cache. cache coherence may take care.
- Value may be present in a register Culprit compiler.
- Value may not be read. How?

#### ٩

| $\mathbf{x} = \mathbf{y} = 0$  |          |  |  |
|--------------------------------|----------|--|--|
| Thread 1                       | Thread 2 |  |  |
| 1: r1 = x                      | 4: x = 1 |  |  |
| 2: y = 1                       | r3 = y   |  |  |
| 3: r2 = x                      |          |  |  |
| r1 == r2 == r3 == 0. Possible? |          |  |  |



- A memory fences guarantees that the UEs will see a consistent view of memory.
- Writes performed before the fence will be visible to reads performed after the fence.
- Reads performed after the fence will obtain a value written no earlier than the latest write before the fence.
- Only for shared memory.
- Explicit management can be error prone. High level: OpenMP flush, shared, Java volatile. *Read yourself.*

Barrier is a synchronization point at which every member of a collection of UEs must arrive before any member can proceed.

- MPI\_Barrier, join, finish, clocks, phasers
- Implemented underneath via passing messages.





- Barriers
- Mutual exclusion: Java synchronized, omp\_set\_lock, omp\_unset\_lock.





#### Communication

Collective communication

V.Krishna Nandivada (IIT Madras)



CS3400

- UEs need to exchange information.
  - Shared memory easy. Challenge synchronize the memory access so that results are correct irrespective of scheduling.
  - distributed memory not much need for synchronization to protect the resources. → Communication plays a big role.
- One to one communication :
- Between all UEs in one event: Collective communication.

When multiple UEs participate in a single communication event, the event is called a collective communication operation. Examples:

- Broadcast: a mechanism to send single message to all UEs.
- Barriers : a synchronization point.
- Reduction: Take a collection of objects, one from each UE, and "combine" into a single value;
  - combined value present only on one UE?
  - combined value present on all UEs?



#### Serial reduction



- Reduction with *n* items takes *n* steps.
- Useful especially if the reduction operator is not associative.

CS3400

• Only one UE knows the result.

```
V.Krishna Nandivada (IIT Madras)
```

# Recursive doubling



CS3400

- Reduction with  $2 \times n$  items takes *n* steps.
- What if number of UEs < number of data items?
- All UEs know the result.

### Tree based reduction



- Reduction with  $2^n$  items takes *n* steps.
- What if number of UEs < number of data items?
- Only one UE knows the result.
- Associative + Commutative or don't care (example?)

CS3400

V.Krishna Nandivada (IIT Madras)

٩

17/57

19/57

# Implementation Mechanisms





#### Memory Consistency Models



- Simple to reason about.
- Compiler optimizations preserve these semantics.
- Independent operations can execute in parallel.

| and the second set | 22 |
|--------------------|----|

CS3400





[Lamport] "A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by the program"

| V.Krishna Nandivada (IIT Madras) | CS3400 | 25 / 57 |
|----------------------------------|--------|---------|

# Understanding Program Order. Dekker's Algorithm

Initially Flag1 = Flag2 = 0

P1 Flag1 = 1 if (Flag2 == 0) *critical section* 

P2 Flag2 = 1 if (Flag1 == 0) *critical section* 

#### Execution:

P1 *(Operation, Location, Value)* Write, Flag1, 1 P2 (Operation, Location, Value) Write, Flag2, 1

Read, Flag2, 0

Read, Flag1, \_\_\_\_

#### Reads of 1 by Flag1 Flag2 are valid. Problematic situation

- Write buffers with read bypassing.
- Overlap or reorder writes/reads by compiler / hardware.



27 / 57

### Sequential consistency

Result of an execution appears as if:

- All operations executed in some sequential order.
- Memory operations of each process in program order.
- Nothing specified about caches, write buffers.

#### Understanding Program Order. Ex 2

| Initially $A = Flag = 0$ |  |
|--------------------------|--|
| P1                       |  |
| A = 23;                  |  |
| Flag = 1;                |  |

P2 while (Flag != 1) {;} ... = A;

P1 Write, A, 23 Write, Flag, 1

V.Krishna Nandivada (IIT Madras)

P2 Read, Flag, 0

Read, Flag, 1 Read, A, \_\_\_\_

#### **Problematic situation**

CS3400

• Overlap or reorder writes/reads by compiler / hardware.



| Initially A | \      | = 0              |                  |
|-------------|--------|------------------|------------------|
| P1          | P2     | P3               | P4               |
| A = 1;      | A = 2; | while (B != 1) ; | while (B != 1) ; |
| B = 1;      | C = 1; | while (C != 1) ; | while (C != 1) ; |
|             |        | tmp1 = A;        | tmp2 = A;        |
| <u> </u>    |        |                  |                  |

Q: What are the possible values of tmp1 and tmp2? Q: Can tmp1 = 1 and tmp2 = 2 be possible? How?

- Cache coherence protocol must serialize writes to same location.
- Writes to same location should be seen in same order by all.

| V.Krishna Nandivada (IIT Madras) | CS3400 | 29 / 57 |
|----------------------------------|--------|---------|
|                                  |        |         |

### Sequential Consistency implementation

Implementations of this model must satisfy the following:

- Program Order Requirement : The operations of same processor must be executed in program order
- Write Atomicity : All writes appear to be instantaneous (no buffer).
- All processors must see all write operations in the same order (cache coherence).
- Easier to implement in architectures with no cache, no write buffers, blocking reads, .

#### Atomicity Ex 2



#### Sequential Consistency - issues

- Sequential Consistency constraints
  - $\bullet \ \text{write} \to \text{read}$
  - write  $\rightarrow$  write
  - $\bullet \ \text{read} \rightarrow \text{read}, \text{write}$
  - Implications (not allowed)
    - Read others' write early.
    - Read own write early.
    - Unserialized writes to the same location.
- Simple model to reason about given parallel programs.
- Makes it very hard to modify a parallel program (automatic and manual)
  - Processor reordering for performance write buffers, overlapped writes, non-blocking reads
  - Compiler transformations scalar replacement, register allocation, instruction scheduling.
  - Programmer reordering code for aesthetics/SE requirements.

CS3400



- Many architectures do not give SC.
- Compiler optimizations on SC are limited.
- Sofwtware engineering issues.
- Give up!
- Use weaker models relax the program order requirement and write atomicity requirement.

- Memory operations of each process happens in program order.
- any valid interleaving of read and write operations is OK.
- all processes must see the same interleaving.



# Sequential consistency examples

| P1   | W(x)1    |          |         |         |          |          |           |
|------|----------|----------|---------|---------|----------|----------|-----------|
| P2   |          | W(x)2    |         |         |          |          |           |
| P3   |          |          | R(x)2   |         | R(x)1    |          |           |
| P4   |          |          |         | R(x)2   | R(x)1    |          |           |
| Soau | ontially | oncietor | t ac bo | th D2 a | nd D/ co | o writoe | in the cr |

Sequentially consistent - as both P3 and P4 see writes in the same sequential order.

| P1 | W(x)1 |       |       |       |       |
|----|-------|-------|-------|-------|-------|
| P2 |       | W(x)2 |       |       |       |
| P3 |       |       | R(x)2 |       | R(x)1 |
| P4 |       |       |       | R(x)1 | R(x)2 |

Sequentially inconsistent - as both P3 and P4 see writes in the two different sequential orders.



| P1          | P2           | P3                                                                                             |
|-------------|--------------|------------------------------------------------------------------------------------------------|
| x = 1;      | y = 1        | z = 1                                                                                          |
| print(y,z)  | print (x,z)  | print (x,y)                                                                                    |
| Inconsisten | t execution: | 1. x = 1<br>2. print (y, z);<br>3. print (x, z);<br>4. y = 1;<br>5. z = 1;<br>6. print (x, y); |

Result of an execution appears as if:

- All operations executed in some sequential order.
- Memory operations of each process in program order.
- Nothing specified about caches, write buffers.

| V.Krishna Nandivada (IIT Madras) | CS3400 | 38 / 57 |
|----------------------------------|--------|---------|

Understanding Program Order. Dekker's Algorithm

CS3400

Initially Flag1 = Flag2 = 0

P1 Flag1 = 1 if (Flag2 == 0) critical section P2 Flag2 = 1 if (Flag1 == 0) *critical section* 

Execution:

V.Krishna Nandivada (IIT Madras)

P1 *(Operation, Location, Value)* Write, Flag1, 1 P2 (Operation, Location, Value) Write, Flag2, 1

Read, Flag2, 0

Read, Flag1, \_\_\_\_

#### Reads of 1 by Flag1 Flag2 are valid. Problematic situation

- Write buffers with read bypassing.
- Overlap or reorder writes/reads by compiler / hardware.



39 / 57

37 / 57

### **Causal Consistency**

- Slightly weaker than Sequential Consistency Model.
- Causally related memory operations : issued by same processor or access same memory location - are seen by every node in causal order.
- Causal order is transitive.
  - memory operations that are causally related must have a total order and
  - program order for the ones issued by same processor.
- Hence such memory operations must be seen in same order by all processors.
- Here, write atomicity has been slightly weakened.
- weaker than sequential consistency, which requires that all nodes see all writes in the same order.

| P1 | W(x)1 |       |       | W(x)3 |       |       |
|----|-------|-------|-------|-------|-------|-------|
| P2 |       | R(x)1 | W(x)2 |       |       |       |
| P3 |       | R(x)1 |       |       | R(x)3 | R(x)2 |
| P4 |       | R(x)1 |       | R(x)2 | R(x)3 |       |

Causally consistent, but not sequentially/strict consistent.

- Processors may see different order.
- All orders respect causal order (program order and read-write order).
- Has no global order, partial order for each processor.

| V.Krishna Nandivada (IIT Madras) | CS3400 | 41 / 57 |  |
|----------------------------------|--------|---------|--|

#### **PRAM** consistency

- All processes see memory writes from one process in the order they were issued from the process.
- Writes from different processes may be seen in a different order on different processes.
- no guarantees about the order in which different processes see writes, except that two or more writes from a single source must arrive in order, as though they were in a pipeline.

| P1 | W(x)1 |
|----|-------|
|----|-------|

| P2 | <br>R(x)1 | W(x)2 |       |       |
|----|-----------|-------|-------|-------|
| P3 | <br>. ,   | . /   | R(x)2 | R(x)1 |
| P4 |           |       | R(x)1 | R(x)2 |

- PRAM  $\leq$  Causal  $\leq$  SC  $\leq$  Strict
- (Also known as, FIFO consistency, or Processor consistency)

CS3400



#### 43 / 57

| P1 | W(x)1 |       |       |       |       |
|----|-------|-------|-------|-------|-------|
| P2 |       | R(x)1 | W(x)2 |       |       |
| P3 |       |       |       | R(x)2 | R(x)1 |
| P4 |       |       |       | R(x)1 | R(x)2 |

- Violates causal consistency.
- Removing the Read from the P2 makes the execution causally consistent.

CS3400

#### Weak Ordering

V.Krishna Nandivada (IIT Madras)

.....

- Divide memory operations into data operations and synchronization operations
- Synchronization operations act like a fence.
  - All data operations before synch in program order must complete before synch is executed.
  - All data operations after synch in program order must wait for synch to complete.
  - Synchronizations are performed in program order.
  - All accesses to synchronization variables are seen by all processes (or nodes, processors) in the same order (sequentially) - these are synchronization operations. Accesses to critical sections are seen sequentially.
  - All other accesses may be seen in different order on different processes
- Illusion of write atomicy has to be maintained.
- Hardware implementation of fence: processor has counter that is incremented when data op is issued, and decremented when data op is completed.

CS3400



- The programmer has to manage synchronization explicitly.
- Weak  $\leq$  PRAM  $\leq$  Causal  $\leq$  SC  $\leq$  Strict

#### **P1 W(x)1 W(x)2** Sync

V.Krishna Nandivada (IIT Madras)

P2

- Sync**R(x)1**
- P2 will observe the most recent write of the variable x, which has the value 2. Thus, it's not a valid sequence.

CS3400

| V.Krishna Nandivada (IIT Madras) | CS3400 | 45 / 57 |
|----------------------------------|--------|---------|
|                                  |        |         |

#### **Release Consistency**

- A problem with weak consistency: when a synchronization variable is accessed, we do not know whether it is done because the process is finished writing shared data or is about to start reading data.
- Synchronization instructions divided : Acquire (such as lock) and Release (such as unlock).
- Acquire: Any memory operation after acquire must be executed only after acquire is completed ( and seen by all ).
- Release :
  - Release must be executed only when all memory operations statements are complete.
  - But accesses after 'release' in program order do not have to wait for release (unless protected by another acquite).
- do "acquite" = that writes on other processors to protected variables will be known
- do "release" = that writes to protected variables are exported
- and will be seen by other machines when they do a lock (lazy release consistency) or immediately (eager release consistency)
- Total order among all synchronization instructions must be maintained.

CS3400

# Weak and Release comparison

- Weak: Shared data can be counted on to be consistent only after a synchronization is done.
- Release: Shared data are made consistent when a critical region is exited.



P1: L W(x)1 W(x)2 U • Example: P2: L R(x)2 U P3: R(x)1 •  $RC \leq Weak \leq PRAM \leq Causal \leq SC \leq Strict$ 

- Delta and Eventual consistency models
  - **Delta consistency**: The write operations will propagate through the shared memory system and all the replicas will be consistent after a fixed time period  $\delta$ .
    - if an object is modified, during the short period of time following its modification, the read may not be consistent.
    - after a fixed time period, the modification is propagated and the read will be consistent.
  - Eventual Consistency Model : The writes propagates eventually (we cannot have a fixed bound on the delay)

CS3400



#### Programmer centric models

- Problem with relaxed models is that most of them are based on the performance optimization that can be performed.
- However, from a programmer's perspective, it is not clear how to use these effectively.
  - How to reason about programs for systems with relaxed memory models
  - How to use the safety nets minimally, to get the desired semantics from program
- Even Sequential Consistency is not simple enough.
- We need models which is simple for the programmer, but provides enough information about program to apply optimization and get efficiency.



51/57

#### Programmer centric models

#### Programmers understand their code:

- Different operations have different semantics P1
  - P2

V.Krishna Nandivada (IIT Madras)

- while (Flag != 1); A = 23:
- B = 37; ...= B;
- Flag = 1; | ... = A;
- Flag = Synchronization; A, B = Data
- Can reorder data operations
- Distinguish data and synchronization

#### Data Race Free 0 - DRF0

# Programming with Data Race Free 0 - DRF0

- Information required:
  - This operation never races (in any SC execution)
- 1. Write program assuming SC
- 2. For every memory operation specified in the program do:



Data-Race-Free-0 Program

- -All accesses distinguished as either synchronization or data
- -All races distinguished as synchronization (in any SC execution)
- Data-Race-Free-0 Model

V.Krishna Nandivada (IIT Madras)

- -Guarantees SC to data-race-free-0 programs
- -(For others, reads return value of some write to the location)



53 / 57

Problems with data race free model

• It does not define any semantics for programs with data races.

CS3400

- A concern for safe languages like Java, which provide safety for any program and cannot let the behavior of a program to be ambiguous.
- Either define safe semantics for such programs or identify them and prevent their execution.
- Define higher abstractions for programmers which are inherently data race free
- Expensive for hardware to implement

# Goals of Memory model

- Programmability? Lost intuitive interface of SC
- Portability? Many different models
- Performance? Can we do better?

#### Future:

- Parallel programs today are inherently non deterministic
- We need deterministic outcomes from our parallel programs.
- Deterministic Outcomes from Inherent non determinism. Possible?



#### Sources

- Patterns for Parallel Programming: Sandors, Massingills.
- multicoreinfo.com
- Wikipedia
- fixstars.com
- Jernej Barbic slides.
- Loop Chunking in the presence of synchronization.
- Vivek Sarkar's slides.
- Sarita Adve's slides.
- Nimit's Singhania's presentation.
- http://regal.csep.umflint.edu/ swturner/Classes/csc577/Online/Chapter06/Chapter06.html
- Java Memory Model JSR-133: "Java Memory Model and Thread Specification Revision"

CS3400

