AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet - Page 40

no-image

AN2203

Manufacturer Part Number
AN2203
Description
MPC7450 RISC Microprocessor Family Software Optimization Guide
Manufacturer
Freescale Semiconductor / Motorola
Datasheet

Available stocks

Company
Part Number
Manufacturer
Quantity
Price
Part Number:
AN22030A
Manufacturer:
PANASONIC/松下
Quantity:
20 000
Load/Store Unit (LSU)
Note that instruction 2 stalls in stage E1 (in the RA latch in Table 3-27). This stall occurs because the line
miss caused by instruction 0 is the same line that instruction 2 requires. Instruction 2 does not finish
execution until cycle 40 (that is eight cycles after instruction 0). This delay is due to two major components.
The first delay component is that instruction 0 finished by using critical forwarded data, whereas instruction
2 must wait for the full cache line to appear before it can start execution (a 4-cycle delay, in this example).
The second delay component is also due to the cache being updated and the occurance of a pipeline restart
condition.
The second issue that this example shows is that the misses are not fully pipelined. Instructions 0 and 4 miss
in the data cache and L2 cache but hit in the L3 cache. The stall caused by the line miss alias between
instructions 0 and 2 has caused the miss for instruction 4 to delay its access start by many cycles. A simple
reordering of the code, as shown in the following example, allows the two load misses to pipeline to the L3
cache, improving performance by nearly 50 percent.
This type of stall is common in some code examples, including simple data streaming or striding array
accesses. For example, a long stream of vector loads with addresses incrementing by 16 bytes (a quad word)
per load results in every other load stalled in this manner, and no miss pipelining occurs. This stall causes
an even larger performance bottleneck when cache misses are required to go to the system bus and when
missed opportunities to pipeline system bus misses occur. This performance problem can be solved by code
reordering as shown in Table 3-28 or by the use of prefetch instructions (dcbt or dst).
The MPC7450 performs back-end allocation of the L1 data cache, which means that it selects the line
replacement (and pushes to the six-entry castout queue as needed) only when a cache reload returns.
Because any new miss transaction may later require a castout, a new miss is not released to the memory
subsystem until a castout queue slot is guaranteed.
40
Instr.
No.
0
1
2
3
4
5
0
1
2
3
4
5
lwz r3,0x0(r9)
add r4,r3,r20
lwz r7,0x20(r9)
lwz r5,0x4(r9)
add r6,r5,r4
add r8,r7,r6
lwz r3,0x0(r9)
add r4,r3,r20
lwz r7,0x20(r9)
lwz r5,0x4(r9)
add r6,r5,r4
add r8,r7,r6
Instruction
MPC7450 RISC Microprocessor Family Software Optimization Guide
Table 3-28. Load Miss Line Alias Example With Reordered Code
Freescale Semiconductor, Inc.
For More Information On This Product,
LMQ0
LMQ1
E0
E1
34
D
D
D
C
0
I
Go to: www.freescale.com
35–36
LMQ0
LMQ1
E1
E0
E1
1
I
I
I
37–39
LMQ1
Miss
E1
E0
E1
2
Cycle Number
LMQ0
LMQ1
Miss
E1
40
E2
3
LMQ1/E2
LMQ0
LMQ1
4–31
E1
41
E
LMQ0/E2
LMQ1/C
LMQ1
E1
32
42
C
C
E
MOTOROLA
LMQ0/C
LMQ1
LMQ1
E1
33
43
E
C

Related parts for AN2203