AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet - Page 40

AN2203

Manufacturer Part Number

AN2203

Description

MPC7450 RISC Microprocessor Family Software Optimization Guide

Manufacturer

Freescale Semiconductor / Motorola

Datasheet

1.AN2203.pdf (76 pages)

Available stocks

Company

Part Number

Manufacturer

Quantity

Price

Company:

Meier Automation Equipment Co., Limited

Part Number:

AN22030A

Manufacturer:

PANASONIC/松下

Quantity:

20 000

Current page: 40 of 76
Download datasheet (650Kb)

Load/Store Unit (LSU)

Note that instruction 2 stalls in stage E1 (in the RA latch in Table 3-27). This stall occurs because the line

miss caused by instruction 0 is the same line that instruction 2 requires. Instruction 2 does not ﬁnish

execution until cycle 40 (that is eight cycles after instruction 0). This delay is due to two major components.

The ﬁrst delay component is that instruction 0 ﬁnished by using critical forwarded data, whereas instruction

2 must wait for the full cache line to appear before it can start execution (a 4-cycle delay, in this example).

The second delay component is also due to the cache being updated and the occurance of a pipeline restart

condition.

The second issue that this example shows is that the misses are not fully pipelined. Instructions 0 and 4 miss

in the data cache and L2 cache but hit in the L3 cache. The stall caused by the line miss alias between

instructions 0 and 2 has caused the miss for instruction 4 to delay its access start by many cycles. A simple

reordering of the code, as shown in the following example, allows the two load misses to pipeline to the L3

cache, improving performance by nearly 50 percent.

This type of stall is common in some code examples, including simple data streaming or striding array

accesses. For example, a long stream of vector loads with addresses incrementing by 16 bytes (a quad word)

per load results in every other load stalled in this manner, and no miss pipelining occurs. This stall causes

an even larger performance bottleneck when cache misses are required to go to the system bus and when

missed opportunities to pipeline system bus misses occur. This performance problem can be solved by code

reordering as shown in Table 3-28 or by the use of prefetch instructions (dcbt or dst).

The MPC7450 performs back-end allocation of the L1 data cache, which means that it selects the line

replacement (and pushes to the six-entry castout queue as needed) only when a cache reload returns.

Because any new miss transaction may later require a castout, a new miss is not released to the memory

subsystem until a castout queue slot is guaranteed.

Instr.

No.

lwz r3,0x0(r9)

add r4,r3,r20

lwz r7,0x20(r9)

lwz r5,0x4(r9)

add r6,r5,r4

add r8,r7,r6

lwz r3,0x0(r9)

add r4,r3,r20

lwz r7,0x20(r9)

lwz r5,0x4(r9)

add r6,r5,r4

add r8,r7,r6

Instruction

MPC7450 RISC Microprocessor Family Software Optimization Guide

Table 3-28. Load Miss Line Alias Example With Reordered Code

Freescale Semiconductor, Inc.

For More Information On This Product,

LMQ0

LMQ1

Go to: www.freescale.com

35–36

LMQ0

LMQ1

37–39

LMQ1

Miss

Cycle Number

LMQ0

LMQ1

Miss

LMQ1/E2

LMQ0

LMQ1

4–31

LMQ0/E2

LMQ1/C

LMQ1

MOTOROLA

LMQ0/C

LMQ1

AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet - Page 40

AN2203

Available stocks

Related parts for AN2203