AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet - Page 50

no-image

AN2203

Manufacturer Part Number
AN2203
Description
MPC7450 RISC Microprocessor Family Software Optimization Guide
Manufacturer
Freescale Semiconductor / Motorola
Datasheet

Available stocks

Company
Part Number
Manufacturer
Quantity
Price
Part Number:
AN22030A
Manufacturer:
PANASONIC/松下
Quantity:
20 000
Other Optimizations Worth Investigating
4.4.2
With longer pipelines, more functional units, and higher instruction issue rate, the MPC7450 can provide
more instruction level parallelism (ILP) than previous microprocessors. Loops that have long dependency
chains may benefit from software pipelining. On those loops, software pipelining increases ILP by
executing several iterations of the loop in parallel.
4.4.3
Small body inner loops may benefit from unrolling on the MPC7450 more than on prior microprocessors
that implement the PowerPC architecture. By increasing the number of instructions in a loop and reducing
the number of times the loop needs to execute, possible stalls are minimized. The drawback of this technique
is the increased instruction space size required to hold the information. In some cases, increased code size
can result in more instruction cache misses, which may cost more performance than the loop unrolling
gained.
The costs of setting up and fixing up code may also affect the loop unrolling trade-off.
To further extend the code example first used in Section 3.1.1, “Fetching,” loop unrolling can be applied.
Because every taken branch on the MPC7450 represents at least one cycle of lost fetch opportunity, it can
often be more advantageous to unroll loops than it has been in the past. The following code assumes that it
is permitted to loop unroll four times (that is, the loop size is evenly divisible by four) and that a value of
loopsize/4 was previously loaded into the CTR (rather than the prior two examples, which assumed the loop
size was loaded into the CTR).
xxxxxx00
xxxxxx04
xxxxxx08
xxxxxx0C
xxxxxx10
xxxxxx14
xxxxxx18
xxxxxx1C
xxxxxx20
Table 4-1 shows that the fetch supply is no longer the bottleneck for the above code sequence. At this point,
the limiting bottleneck becomes the single cache port. For this code, one effective iteration (lwzu/add) is
completing per cycle. Loop unrolling doubles the performance of the aligned example case.
50
Table 4-1. MPC7450 Execution of One—Two Iterations of Code Loop Example
Software Pipelining
Loop Unrolling for Long Pipelines
lwzu (1)
add (1)
lwzu (2)
add (2)
lwzu (3)
add (3)
lwzu (4)
loop:
Instruction
MPC7450 RISC Microprocessor Family Software Optimization Guide
Freescale Semiconductor, Inc.
For More Information On This Product,
lwzu r10,0x4(r9)
add r11,r11,r10
lwzu r10,0x4(r9)
add r11,r11,r10
lwzu r10,0x4(r9)
add r11,r11,r10
lwzu r10,0x4(r9)
add r11,r11,r10
bdnz loop
D
D
0
D
D
1
I
I
Go to: www.freescale.com
E0
D
D
2
I
I
E1
E0
D
3
I
I
E2
E1
E0
4
I
E2
E1
E0
C
E
5
E2
E1
C
C
6
E
E2
C
C
E
7
C
C
8
9
MOTOROLA

Related parts for AN2203