## Lecture Overview Introduction Constraints List Scheduling Conclusion Lecture Overview ### Introduction Scheduling (Part 1) Introduction and Acyclic Scheduling CS 380C: Advanced Compiler Techniques Thursday, October 11th 2007 # Code Generator Back end part of compiler (code generator) Instruction scheduling Register allocation Instruction Scheduling Input: set of instructions Output: total order on that set #### 4 D > 4 D > 4 E > 4 E > E 9 Q C #### Lectures - Introduction and acylic scheduling (today) - Software pipelining (Tuesday 23) #### Today - Definition of instruction scheduling - Constraints - Scheduling process - Acylic scheduling: list scheduling 4 D > 4 B > 4 E > 4 E > 2 9 9 9 #### Context - Backend part of the compiler chain (code generation) - Inputs: set of instructions (assembly instructions) - Outputs: a schedule - Set of scheduling dates (one date per instruction) - Total order #### Goal - Minimize the execution time (number of cycles) - Different possible objective functions to minimize: - Power consumption - ... 4 D F 4 D F 4 E F 4 E F 2 990 | Lecture Overview | Introduction | Constraints | List Scheduling | Conclusion | |------------------|--------------|-------------|-----------------|------------| | Constraints | | | | | | | | | | | • Is it possible to generate any schedule? • Is it possible to generate any schedule? Possibility to change instruction order? Lecture Overview Introduction Constraints List Scheduling Conclusion Constraints Lecture Overview Introduction Constraints List Scheduling Conclusion Constraints • Data dependences enforce a partial order for the final schedule - Is it possible to generate any schedule? - Example: a = b + c ; d = a + 3 ; e = f + d ; - Possibility to change instruction order? - No, because of data dependences - Flow dependences on a and 4 D > 4 D > 4 E > 4 E > E 9 Q C - Data dependences enforce a partial order for the final schedule - Other types of constraints? #### Example: a = b + c ; d = e + f ; Target architecture with 1 ALU - Data dependences enforce a partial order for the final schedule - Other types of constraints? • Other types of constraints? Resource constraints Rule - Target architecture with 1 ALU - Impossible to use the same functional unit concurrently - Resource constraints # Lecture Overview Introduction Constraints List Scheduling Conclusion Constraints - Data dependences enforce a partial order for the final schedule - Other types of constraints? - Target architecture with 1 ALU - Impossible to use the same functional unit concurrently - Resource constraints #### Constraints Two types of constraints: data dependences and resource usage • The final schedule must respect these constraints Lecture Overview Introduction Constraints Link Calculation . . . ### Data Dependence Representation #### Constraints - Data dependences - Resource constraints #### Rule • The final schedule must respect these constraints Constraints influencing Instruction Scheduling #### Dealing with constraints - How to represent such constraints to deal with during the scheduling process? - ullet Data dependences o graph - ullet Resource constraints o reservation tables or automaton #### Data Dependence Graph (DDG) - 1 node $\Leftrightarrow$ 1 instruction - 1 edge ⇔ 1 flow dependence (directed graph) - Edge label = parameters of the dependence - Latency (# of cycles) - Distance (# of iterations) 4 D F 4 D F 4 E F 4 E F 2 990 #### Data Dependence Representation #### Data Dependence Graph (DDG) - ullet 1 node $\Leftrightarrow$ 1 instruction - 1 edge ⇔ 1 flow dependence (directed graph) - Edge label = parameters of the dependence - Latency (# of cycles) - Distance (# of iterations) - Example (1-cycle latency): ``` a = b + c ; // ADD1 d = a + 3 ; // ADD2 e = a + d ; // ADD3 ``` Lecture Overview Introduction Constraints List Scheduling Conclusion #### Data Dependence Representation #### Data Dependence Graph (DDG) - 1 node $\Leftrightarrow$ 1 instruction - 1 edge ⇔ 1 flow dependence (directed graph) - Edge label = parameters of the dependence - Latency (# of cycles) - Distance (# of iterations) - Example (1-cycle latency): a = b + c; // ADD1 d = a + 3; // ADD2 e = a + d; // ADD3 4 D > 4 D > 4 E > 4 E > E 9 Q C ### Lecture Overview Introduction Constraints List Scheduling Conclusio Data Dependence Representation – Example 2 - Daxpy loop: double alpha times X plus Y - $y \leftarrow \alpha \times x + y$ - Targeting Itanium ISA: - LD: Load from memory (latency 6 cycles from L2 cache) - ST: Store to memory - FMA: Fuse multiply and add (latency 4 cycles) ## Lecture Overview Introduction Constraints List Scheduling Conclusion Data Dependence Representation – Example 2 - Daxpy loop: double alpha times X plus Y - $y \leftarrow \alpha \times x + y$ - Targeting Itanium ISA: - LD: Load from memory (latency 6 cycles from L2 cache) - ST: Store to memory - FMA: Fuse multiply and add (latency 4 cycles) Data Dependence Representation - Example 3 #### Data Dependence Representation - Example 3 - Daxpy loop with inter-iteration dependence - C-like code: for ( i=0; i<N; i++) | Y[i+2] = alpha\*X[i] + Y[i] - Inter-iteration dependence - Distance of 2 - Inter-iteration dependence - Distance of 2 4 D F 4 D F 4 E F 4 E F 9 Q C Lecture Overview Introduction Constraints List Scheduling Conclus Data Dependence Representation #### Remarks - Circuits allowed for a distance > 0 - For basic block, this is only a DAG #### Drawbacks - One fix digit for latency - Fixed latencies - May not be suitable for cache/memory accesses - One digit for the distance - Only uniform dependences Lecture Overview Introduction Constraints List Scheduling Conclusion Resource Constraint Representation #### Resources Second set of constraints: resource usage/assignment #### Overview - Need to check if two instructions may race for the same resource (functional unit, bus, pipeline stage, ...) - $\bullet$ Can be several cycles ahead (latency > 1) 40 × 40 × 42 × 42 × 2 × 990 4 B > 4 B > 4 E > 4 B > 900 Lecture Overview Introduction Constraints List Scheduling Conclusion Resource Constraint Representation #### Resources • Second set of constraints: resource usage/assignment #### Overview - Need to check if two instructions may race for the same resource (functional unit, bus, pipeline stage, ...) - ullet Can be several cycles ahead (latency > 1) #### State-of-the-art • 2 representations: reservation tables and automaton Lecture Overview Introduction Constraints List Scheduling Conclusion Reservation Tables – Definition #### Reservation tables • Intuitive way: resource usage of one instruction as a 2D table #### Semantics - Rows: latency of the instruction (in cycles) - Columns: number of resources available in the target architecture - Cell (i,j) is marked $\Leftrightarrow$ instruction requires $i^{th}$ resource during its $j^{th}$ cycle of execution - Binary tables - Several tables per instruction (alternatives/options) #### Reservation Tables - Example 1 #### Example with pipelined resources: - 2 fully pipelined resources (ALU): ALUO and ALU1 - 2 instructions ADD and MUL - Constraints: - ADD can be executed on ALUO or ALU1 - MUL can only be executed on ALU1 ### Reservation Tables - Example 1 #### Example with pipelined resources: - 2 fully pipelined resources (ALU): ALUO and ALU1 - 2 instructions ADD and MUL - Constraints: - ADD can be executed on ALUO or ALU1 - MUL can only be executed on ALU1 | Tables for ADD: | | | |-----------------|------|------| | | ALUO | ALU1 | | 0 | X | | | OR | | | | | ALUO | ALU1 | | 0 | | Χ | | Table for MUL: | | | |----------------|------|------| | | ALUO | ALU1 | | 0 | | Χ | #### ADD instruction: | | ALUO | ALU1 | |---|------|------| | 0 | Χ | | OR | | ALUO | ALU1 | |---|------|------| | 0 | | X | MUL instruction: | | ALUO | ALU1 | |---|------|------| | 0 | | Х | • Are the following sequences valid? | | ADD | | | ? | |---|-----------------|-----------------------------------------|-----------------------|--------------| | | $\mathtt{MUL}$ | | | ? | | 1 | MUL | | | ? | | ; | ADD | | | ? | | 1 | MUL | ; | MUL | ? | | | <br> <br> <br>; | ADD<br> MUL<br> MUL<br>; ADD<br> MUL | MUL<br> MUL<br>; ADD | MUL<br> MUL | #### ADD instruction: | | ALUO | ALU1 | <ul> <li>Are the following</li> </ul> | sequences valid? | |----|--------|--------|---------------------------------------|------------------| | 0 | Х | | ADD ADD | / | | OR | | | ADD MUL | $\checkmark$ | | JK | | | MUL MUL | × | | | AT IIO | AT TT1 | ΔΩΛ • ΔΩΛ | . / | Χ DD | MUL JL | MUL ADD : ADD ADD | MUL ; MUL #### MUL instruction: | | ALUO | ALU1 | |---|------|------| | 0 | | X | | Lecture Overview | Introduction | Constraints | List Scheduling | Conclusion | |------------------|--------------|-------------|-----------------|------------| | Reservation | Tables – Ex | ample 1 | | | #### ADD instruction: | | ALUO | ALU1 | |---|------|------| | 0 | Χ | | #### OR | | ALUO | ALU1 | |---|------|------| | 0 | | Χ | #### MUL instruction: | _ | AT 110 | A T TT4 | |---|--------|---------| | | ALUO | ALU1 | | 0 | | X | • Are the following sequences valid? | ADD | | ADD | | | | |-----|---|----------------|---|-----|----------| | ADD | | $\mathtt{MUL}$ | | | | | MUL | 1 | $\mathtt{MUL}$ | | | $\times$ | | ADD | ; | ADD | | | | | ADD | 1 | MUL | ; | MUL | | - Test if instructions can be scheduled together: AND operation - Update resource usage: OR operation | Lecture Overview | Introduction | Constraints | List Scheduling | Conclusion | |------------------|--------------|-------------|-----------------|------------| | Reservation | Tables – Ex | ample 2 | | | #### Example with complex resources: - 2 resources: ALU and LD/ST - 3 instructions ADD, SUB and LD - Constraints: - ADD instructions have a latency of 1 cycle - SUB instructions have a latency of 2 cycles - LD uses first the ALU for 1 cycle and then the LD/ST resource for 1 cycle #### Reservation Tables - Example 2 Example with complex resources: - 2 resources: ALU and LD/ST - 3 instructions ADD, SUB and LD - Constraints: - ullet ADD instructions have a latency of 1 cycle - SUB instructions have a latency of 2 cycles - LD uses first the ALU for 1 cycle and then the LD/ST resource for 1 cycle | Table for ADD: | | | | |----------------|-----|-------|--| | | ALU | LD/ST | | | 0 | Х | | | | Table for SUB: | | | | |----------------|-----|-------|--| | | ALU | LD/ST | | | 0 | Χ | | | | 1 | Χ | | | | Table for LD: | | | | |---------------|-----|-------|--| | | ALU | LD/ST | | | 0 | Χ | | | | 1 | | Χ | | #### ADD instruction: | | ALU | LD/ST | |---|-----|-------| | 0 | Х | | #### SUB instruction: | | ALU | LD/ST | |---|-----|-------| | 0 | Χ | | | 1 | Χ | | #### LD instruction: | I | | ALU | LD/ST | |---|---|-----|-------| | | 0 | Χ | | | | 1 | | Χ | • Are the following sequences valid? | ADD SUB | ? | |----------------|---| | ADD ADD | ? | | SUB LD | ? | | LD ; ADD | ? | | LD ; SUB | ? | | SUB ; LD | ? | | ADD ; SUB ; LD | ? | | LD ; ADD ; SUB | ? | # Reservation Tables - Example 2 #### ADD instruction: | | ALU | LD/ST | |---|-----|-------| | 0 | Χ | | #### SUB instruction: | | ALU | LD/ST | |---|-----|-------| | 0 | Χ | | | 1 | Χ | | #### LD instruction: | | ALU | LD/ST | |---|-----|-------| | 0 | Χ | | | 1 | | Х | • Are the following sequences valid? | ADD SUB | $\times$ | |----------------|----------| | ADD ADD | $\times$ | | SUB LD | $\times$ | | LD ; ADD | | | LD ; SUB | | | SUB ; LD | $\times$ | | ADD ; SUB ; LD | $\times$ | | LD ; ADD ; SUB | | ### Reservation Tables - Example 2 #### ADD instruction: | ſ | | ALU | LD/ST | |---|---|-----|-------| | Ī | 0 | Χ | | #### SUB instruction: | | ALU | LD/ST | |---|-----|-------| | 0 | Χ | | | 1 | Х | | #### LD instruction: | | ALU | LD/ST | |---|-----|-------| | 0 | Χ | | | 1 | | Χ | • Are the following sequences valid? | ADD SUB | × | |----------------|-----------| | ADD ADD | × | | SUB LD | × | | LD ; ADD | $\sqrt{}$ | | LD ; SUB | $\sqrt{}$ | | SUB ; LD | × | | ADD ; SUB ; LD | × | | LD ; ADD ; SUB | <b>√</b> | • Test and update according to latencies of instructions | Lecture Overview | Introduction | Constraints | List Scheduling | Conclusion | |------------------|--------------|-------------|-----------------|------------| | Reservation | Table – Sum | nmary | | | - AND operation to check if several instruction can be scheduled - OR operation to update the resource state #### Advantages - Intuitive representation - Small storage #### Drawbacks - Many tests - Redundant information # **Automaton** • Pre-processing of possible resource usages #### Semantics - 1 state of the automaton $\Leftrightarrow$ 1 assignment of resources - $\bullet$ 1 transition of the automaton $\Leftrightarrow$ scheduling of an instruction at the current cycle #### Transition label - Label of a transition: the instruction to schedule - Special label: NOP instruction to advance the current cycle • 2 fully-pipelined resources $\Rightarrow$ 2 bits per state 4 D > 4 D > 4 E > 4 E > 2 9 Q Q • Are the following sequences valid? | ADD | | ADD | ? | |------|---|------|---| | ADD | I | MUL | ? | | MIII | ī | MIII | 7 | ADD; ADD ? ADD | MUL; MUL? Lecture Overview Introduction Constraints List Scheduling Conclusion Automaton — Example 1 • Are the following sequences valid? | ADD | 1 | ADD | $\checkmark$ | |-----|---|-----|--------------| | ADD | 1 | MUL | $\checkmark$ | | MUL | Ī | MUL | × | ADD ; ADD $\sqrt{}$ ADD | MUL ; MUL $\sqrt{}$ 1 (D) (B) (E) (E) (D) (O) | Lecture Overview | Introduction | Constraints | List Scheduling | Conclusion | |------------------|--------------|-------------|-----------------|------------| | Automaton - | – Example 2 | | | | | ADD instruction: | | | | | |------------------|-----|---|-------|--| | | ALU | | LD/ST | | | | 0 | Χ | | | | SUB instruction: | | | | | |------------------|-----|-------|--|--| | | ALU | LD/ST | | | | 0 | Χ | | | | | 1 | Х | | | | | LD instruction: | | | | | | | |-----------------|-----------|---|--|--|--|--| | | ALU LD/ST | | | | | | | 0 | Χ | | | | | | | 1 | | Χ | | | | | 1 X #### Automaton – Example 2 • Are the following sequences valid? | ? | LD ; SUB | ? | |---|------------------|-----------------------------| | ? | SUB ; LD | ? | | ? | ADD ; SUB ; LD | ? | | ? | , , | | | | ?<br>?<br>?<br>? | ? SUB; LD<br>? ADD; SUB; LD | Lecture Overview Introduction **Constraints** List Scheduling Conclusion #### Automaton – Example 2 • Are the following sequences valid? | ADD SUB | × | LD ; SUB | $\checkmark$ | |-----------|--------------|----------------|--------------| | ADD ADD | × | SUB ; LD | × | | SUB LD | × | ADD ; SUB ; LD | × | | LD ; ADD | $\checkmark$ | LD ; ADD ; SUB | <b>V</b> | Lecture Overview Introduction Constraints List Scheduling Conclusion Automaton — Summary #### Hen - An instruction can be currently scheduled if there is an output arc from the current state labeled with this instruction - Update the state by following this arc #### Advantages • Low query time: table lookup #### Drawbacks - Huge computational time (offline) - Large storage - ⇒ split into several automata - Not very flexible - e.g. hard to schedule instructions not cycle-wise Lecture Overview Introduction Constraints List Scheduling Conclusion Scheduling Process ### Scheme of a classical scheduler - High-level part: main heuristic taken care of the data dependences and driving the scheduling process - Low-level part: storage of the resource usages and updates of the global assignments 4 D > 4 B > 4 E > 4 E > 2 9 Q Q Lecture Overview Introduction Constraints List Scheduling Conclusion Scheduling Process #### Scheme of a classical scheduler - High-level part: main heuristic taken care of the data dependences and driving the scheduling process - Low-level part: storage of the resource usages and updates of the global assignments #### Scheduling process - Process begins in the high-level part - Pick up the next instruction to insert in the partial schedule - Query the low-level part for resource assignements: - If okay, then goes on with another instruction - Otherwise backtrack Lecture Overview Introduction Constraints List Scheduling Conclusion Acyclic Scheduling: List Scheduling #### Context - ullet Schedule a basic block $\Rightarrow$ acyclic scheduling - Goal: minimize the length of the generated code - Must respect data dependences and resource constraints #### Example Sum the first element of 3 vectors X, Y and Z in the first cell of array A: $$A[0] = X[0] + Y[0] + Z[0];$$ - 3 instructions: ADD, LD, ST (1-cycle latency) - 3 fully-pipelined resources: ALU, LDO and LD/ST1 units ### Acyclic Scheduling – Example #### Reservation tables: DDG? 0 1,0 ST(A) Χ 4 D > 4 B > 4 E > 4 E > 2 9 9 9 A possible schedule? | Lecture Overview | Introduction | Constraints | List Scheduling | Conclusion | |------------------|---------------|-------------|-----------------|------------| | Acyclic Sch | eduling – Exa | ample | | | • A possible schedule respecting both constraints and minimizing the total length: ``` LD(X) | LD(Y) ; // Cycle 1 ADD1 | LD(Z) ; // Cycle 2 ADD2 ; // Cycle 3 ST ; // Cycle 4 = length ``` A possible schedule respecting both constraints and minimizing the total length: ST(A) ``` LD(X) | LD(Y); // Cycle 1 ADD1 | LD(Z); // Cycle 2 ADD2; // Cycle 3 ST; // Cycle 4 = length ``` - Good the execute as much instructions as possible - Pick up the good instruction is crucial (LD(X) and LD(Y) before LD(Z)) - Be careful of explicit resource assignments through reservation tables: - Only one valid combination to execute a ST and a LD at the same cycle Lecture Overview Introduction Constraints List Scheduling #### Principle - List scheduling algorithm is based on this approach - Sort the instruction according to priority based on data dependences - Pick up one ready instruction in priority order - Until every instruction has been scheduled #### Priority - Many priority schemes exist - We will use the *height-based priority*: - Priority of a node is the longest path from that node to the furthest leaf - The path is weighted by latencies Lecture Overview Introduction Constraints List Scheduling Conclusion #### Conclusion #### Instruction scheduling • Generate a total order of a set of instructions #### Constraints - Data dependences - Represented as a graph: DDG - Resource usages - Represented as reservation tables or automaton #### Acyclic scheduling - List scheduling - Assign priority to instructions according to their contribution to the critical path