## Compiler Construction 2009/2010 Instruction Selection

Peter Thiemann

December 1, 2009

▲□▶ ▲□▶ ▲三▶ ▲三▶ 三三 のへで

## **Optimal Tiling**

No two adjacent tiles can be replaced by a larger tile of lower cost.

#### **Optimum Tiling**

The total cost of the tiling is minimal among all possible tilings.

◆□▶ ◆□▶ ▲□▶ ▲□▶ ■ ののの

Tiling is optimum ⇒ tiling is optimal

Temp munchExpr (Tree.Exp e) {
 test patterns from largest to smallest

choose the first matching pattern with instruction INS

foreach (e\_i : wildcard (pattern, e))
 recursively invoke temp\_i = munchExpr (e\_i)

```
return temp_0
```

}



▲□▶ ▲□▶ ▲□▶ ▲□▶ = 三 のへで



▲□▶ ▲□▶ ▲□▶ ▲□▶ = 三 のへで



# $\begin{array}{rcl} \text{ADDI} & r_1 & \leftarrow & r_0 + 1 \\ \text{LOAD} & r_1 & \leftarrow & M[r_1 + 2] \end{array}$

▲□▶ ▲□▶ ▲□▶ ▲□▶ ▲□ ● のへぐ

```
void matchExpr (Tree.Exp e) {
  for (Tree.Exp kid : e.kids())
    matchExpr (kid);
  cost = INFINITY
  for each pattern P i
    if (P i.matches (e)) {
      cost i = cost(P i)
             + sum ((wildcard (P i, e)).mincost)
      if (cost_i < cost) { cost = cost_i; choice =
    }
  e.matched = P_{choice}
  e.mincost = cost
}
```

◆□▶ ◆□▶ ▲□▶ ▲□▶ ■ ののの

```
Temp emission (Tree.Exp e) {
  foreach (e_i : wildcard (e.matched, e)) {
    temp_i = emission (e_i)
  }
```

◆□▶ ◆□▶ ▲□▶ ▲□▶ □ のQ@

```
return temp_0
```

}

- Additional side conditions (e.g., size of constants, special constants)
- Matching of patterns can be done with a decision tree that avoids checking the same node twice
- The bottom up matcher can remember partial matches and avoid rechecking the same nodes

< □ > < 同 > < 三 > < 三 > < 三 > < ○ < ○ </p>

 $\Rightarrow$  tree automata

A <u>bottom-up tree automaton</u> is  $\mathcal{M} = (Q, \Sigma, \delta, F)$  where

- Q is a finite set of states
- Σ a ranked alphabet (the tree constructors)
- $\delta \subseteq \Sigma^{(n)} \times Q^n \times Q$  ( $\forall n$ ) the transition relation
- $F \subseteq Q$  the set of final states

 ${\cal M}$  is deterministic if  $\delta$  is a function.

Define 
$$\Rightarrow$$
 on  $T_{\Sigma+Q}$  by  
 $t[F(q_1, \dots, q_n)] \Rightarrow t[q_0]$  if  $(F, q_1, \dots, q_n, q_0) \in \delta$   
 $t \in L(\mathcal{M})$  if  $t \Rightarrow^* q$  with  $q \in F$ 

(ロ) (同) (三) (三) (三) (三) (○) (○)

#### Tree automaton for



▲□▶ ▲□▶ ▲□▶ ▲□▶ ▲□ ● のへぐ

• 
$$Q = \{q_t, q_c, q_a, q_m\}$$
  
•  $F = \{q_m\}$   
•  $\delta = \underbrace{\sum \quad q_1 \quad q_2 \quad q_{out}}_{\text{CONST}} \qquad q_c \quad q_c \quad q_t \quad q_t \quad q_t \quad q_t \quad q_t \quad q_a \quad MEM \quad q_a \quad q_m$ 

- Generate a bu tree automaton for each pattern
- Simulate them in parallel on expression tree
- At each node
  - determine all patterns whose root matches the current node

(日) (日) (日) (日) (日) (日) (日)

- compute their cost and mark the node with the minimum cost pattern
- There are tools to compile a pattern specification to such an automaton ⇒ BURG (Fraser, Hanson, Proebsting)

- Extension: Different pattern sets leading to different kinds of results
- Some architectures habe different kinds of registers that obey different restrictions
- Set of patterns for each kind of register
- Example: M680x0 distinguishes data and address registers, only the latter may serve for address calculations and indirect addressing

(日) (日) (日) (日) (日) (日) (日)

 $\Rightarrow$  Tree grammar needed

A context-free tree grammar is defined by  $\mathcal{G} = (\textit{N}, \Sigma, \textit{P}, \textit{S})$  where

- N is a finite set of non-terminals
- Σ is a ranked alphabet
- $S \in N$  is the start symbol
- $P \subseteq N \times T_{\Sigma+N}$

Define  $\Rightarrow$  on  $T_{\Sigma+N}$  by

$$t[A] \Rightarrow t[r]$$
 in  $A \rightarrow r \in P$ 

< □ > < 同 > < 三 > < 三 > < 三 > < ○ < ○ </p>

 $t \in L(\mathcal{G})$  if  $S \Rightarrow^* t \in T_{\Sigma}$ 



## Efficiency of Tiling

- N number of nodes in input tree
- T number of patterns
- *K* average number of labeled nodes in pattern
- *K'* maximum number of nodes to check for a match
- T' average number of patterns that match at each node
- Maximal munch. Each match consumes K nodes: test for matches at N/K nodes. At each candidate node, choose pattern with K' + T' tests.

(K' + T')N/K steps on average. Worst case: K = 1.

- **Dynamic programming.** Tests every pattern at every node: (K' + T')N.
- ⇒ same linear worst-case complexity. (K' + T')/K is constant, anyway.

| RISC                            | CISC                                              |
|---------------------------------|---------------------------------------------------|
| 32 registers                    | few registers (16, 8, 6)                          |
| one class of registers          | different classes with re-<br>stricted operations |
| ALU instructions only be-       | ALU operations with mem-                          |
| tween registers                 | ory operands                                      |
| three-adress instructions       | two-address instructions                          |
| $r_1 \leftarrow r_2 \oplus r_3$ | $r_1 \leftarrow r_1 \oplus r_2$                   |
| one addressing mode for         | several addressing modes                          |
| load/store                      |                                                   |
| every instruction 32 bits long  | different instruction lengths                     |
| one result / instruction        | instructions w/ side effects                      |

### Pentium / x86 (32-bit)

- six GPR, sp, bp
- multiply / divide only on eax
- generally two-address instructions

## MC 680x0 (32-bit)

• 8 data registers, 7 address registers, 2 stack registers

◆□▶ ◆□▶ ▲□▶ ▲□▶ ■ ののの

- ALU operations generally on data registers, indirect addressing only through address registers
- generally two-address instructions
- esoteric addressing modes (68030)

- [Few Registers] generate temporaries and rely on register allocation
- [Restricted Registers] generate extra moves and hope that register allocation can get rid of them. Example:
  - Multiply on Pentium requires one operand and destination in eax
  - Most-significant word of result stored to edx

Hence for  $t_1 \leftarrow t_2 \cdot t_3$  generate

▲□▶ ▲圖▶ ▲≣▶ ▲≣▶ ▲≣ めるの

[Two-address instructions] Generate extra move instructions. For  $t_1 \leftarrow t_2 + t_3$  generate mov  $t_1, t_2$   $t_1 \leftarrow t_2$ add  $t_1, t_3$   $t_1 \leftarrow t_1 + t_3$ ; Special addressing modes] Example: memory addressing mov eax, [ebp-8] add [ebp-8], ecx add eax, ecx mov [ebp-8], eax

Two choices:

- Ignore and use separate load and store instructions. Same speed, but an extra register gets trashed.
- Avoid register pressure and use addressing mode. More work for the pattern matcher.

### • [Variable-length instructions] No problem for instruction selection or register allocation. Assembler deals with it.

[Instructions with side effects]
 Example: autoincrement after memory fetch (MC 680x0)

$$r_2 \leftarrow M[r_1]; \qquad r_1 \leftarrow r_1 + 4$$

(ロ) (同) (三) (三) (三) (三) (○) (○)

Hard to incorporate in tree-pattern based instruction selection.





Different algorithm for instruction selection

Class hierarchy for representing instructions

Instr



◆□▶ ◆□▶ ▲□▶ ▲□▶ ■ ののの

Each instruction specifies a

- set of defined temporaries
- set of used temporaries
- set of branch targets

each of which may be empty

## Abstract Assembly Language

Creating an Operation



Independent of register allocation and jump labels

An operation's def and use set must account for <u>all</u> defined and used registers.

• Example: the multiplication instruction on Pentium

```
new OPER ("mul 's0",
    L (pentium.EAX, L (pentium.EDX, null))
    L (argTemp, L (pentium.EAX, null)));
```

< □ > < 同 > < 三 > < 三 > < 三 > < ○ < ○ </p>

- Example: a procedure call trashes many registers (see the calling convention of the architecture)
  - return address
  - return-value register
  - caller-save registers