4/9/2022

Delay Slot Instruction Mips

Delay

The instruction in the load delay slot cannot use the data loaded by the load instruction. The load delay slot can be filled with an instruction that is not dependent on the load; a nop is substituted if such an instruction cannot be found. MIPS I has instructions to perform addition and subtraction. . Branch likely instructions do not execute the instruction in the delay slot if the branch is not taken (they are said to nullify the instruction in the delay slot). Although the Branch Likely instructions are included in this specification, software is strongly encouraged to avoid the use of the Branch Likely instructions, as they will be. The MIPS architecture executes one instruction during such load delay slots. In many cases, the compiler can rearrange the order of instructions in order to fill the delay slot with useful instruction. For example, a value could be loaded a few instructions earlier than it is actually needed, as long as the target register was not being used. There are 10 branch instructions: BEQ, BNE, BLEZ, BGEZ, BLTZ, BGTZ, J, JAL, JR and JALR. These all update the pc. The MIPS makes use of a branch delay slot to remove the need to flush the pipeline when a branch is taken. In other words, the instruction immediately following a branch will alwaysbeexecutedregardlessof whetherthebranch is takenor. The instruction in the branch delay slot executes twice. It executes once because it’s in the branch delay slot. It executes again because it’s the destination of the branch. One final note is the case of emulated instructions.

In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branchinstruction on a RISC or DSP architecture; this instruction will execute even if the preceding branch is taken. Thus, by design, the instructions appear to execute in an illogical or incorrect order. It is typical for assemblers to automatically reorder instructions by default, hiding the awkwardness from assembly developers and compilers.

Branch delay slots[edit]

When a branch instruction is involved, the location of the following delay slot instruction in the pipeline may be called a branch delay slot. Branch delay slots are found mainly in DSP architectures and older RISC architectures. MIPS, PA-RISC, ETRAX CRIS, SuperH, and SPARC are RISC architectures that each have a single branch delay slot; PowerPC, ARM, Alpha, and RISC-V do not have any. DSP architectures that each have a single branch delay slot include the VS DSP, μPD77230 and TMS320C3x. The SHARC DSP and MIPS-X use a double branch delay slot; such a processor will execute a pair of instructions following a branch instruction before the branch takes effect. The TMS320C4x uses a triple branch delay slot.

Delay Slot Instruction Mips

The following example shows delayed branches in assembly language for the SHARC DSP including a pair after the RTS instruction. Registers R0 through R9 are cleared to zero in order by number (the register cleared after R6 is R7, not R9). No instruction executes more than once.

The goal of a pipelined architecture is to complete an instruction every clock cycle. To maintain this rate, the pipeline must be full of instructions at all times. The branch delay slot is a side effect of pipelined architectures due to the branch hazard, i.e. the fact that the branch would not be resolved until the instruction has worked its way through the pipeline. A simple design would insert stalls into the pipeline after a branch instruction until the new branch target address is computed and loaded into the program counter. Each cycle where a stall is inserted is considered one branch delay slot. A more sophisticated design would execute program instructions that are not dependent on the result of the branch instruction. This optimization can be performed in software at compile time by moving instructions into branch delay slots in the in-memory instruction stream, if the hardware supports this. Another side effect is that special handling is needed when managing breakpoints on instructions as well as stepping while debugging within branch delay slot.

The ideal number of branch delay slots in a particular pipeline implementation is dictated by the number of pipeline stages, the presence of register forwarding, what stage of the pipeline the branch conditions are computed, whether or not a branch target buffer (BTB) is used and many other factors. Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next. This inevitably requires that newer hardware implementations contain extra hardware to ensure that the architectural behavior is followed despite no longer being relevant.

Load delay slot[edit]

A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem.

The following example is MIPS I assembly code, showing both a load delay slot and a branch delay slot.

See also[edit]

External links[edit]

Retrieved from 'https://en.wikipedia.org/w/index.php?title=Delay_slot&oldid=982126913'

Since MIPS and SPARC use branch delay slots, we're faced with an interesting issue on how to implement them correctly. There are two issues: basic support for branch delay slots, and support for conditionally executed delay-slot instructions (SPARC 'annulled' delay slots).

Basic functionality

In the context of M5 instruction execution, PC is the current instruction's PC and NPC is the PC of the next instruction. Conventional non-delayed branches (as in Alpha) write to NPC to change the next instruction executed. Conceptually all non-branch instructions also update NPC to PC+4, though currently this is implemented in the CPU model and not in the ISA. (Obviously this would have to change if we had variable-length instructions.) Thus between each pair of instruction executions we simply set PC to NPC to advance the program counter.

In delayed-branch architectures, we need an additional PC, NNPC, which is the PC of the 'next next' instruction. A simple delayed branch can be implemented by writing the target address to NNPC instead of NPC. Non-branch instructions set NNPC to NPC+4. Between each pair of instruction executions we set PC to NPC and NPC to NNPC. Note that this model (I think) automatically supports some of the funky SPARC delayed branch semantics, such as having a branch in the delay slot of another branch. In MIPS, executing a branch in a branch delay slot results in UNDETERMINED behavior.

Instruction

Conditional delay slot instructions

Delay slot instruction mips software

Things get more complicated when the delay-slot instruction is effectively predicated on the branch direction. SPARC supports 'annulled' branches in which the delay-slot instruction is not executed if the branch is not taken. (Actually for unconditional branches the annul bit means never execute the delay slot instruction, which is kind of the opposite of what it means for conditional branches.)

Apparently MIPS is even more complex, with bits that allow the delay-slot instruction to be predicated in either direction (execute only if taken or only if not taken).(Korey, can you fill in some specifics here?)

I dont think MIPS allows the delay-slot instructions to be predicated. (I may have miscommunicated a bit on my original post) MIPS actually has separate opcodes to distinguish between branches where the delay-slot is always executed and instructions where the delay-slot is conditionally instructed. Additionally, I gather that there is no difference between how MIPS and SPARC treats these delay-slot instructions. The only difference is between the annulled-bit and the opcode to specify which 'handling' is required. Luckily, the decoder/ISA scheme makes the annulled-bit/opcode difference a non-issue. --Ksewell 18:37, 16 February 2006 (EST)
By predicated I just meant conditionally executed. The only other possibility that MIPS might have that SPARC doesn't is a flavor of branch that only executes the delay slot instruction if the branch is *not* taken (as opposed to only if the branch *is* taken). --Stever 19:52, 16 February 2006 (EST)

The 'formal semantics' of SPARC delayed branches are shown in Table 13 of Section 6.3.4 of the SPARC v9 reference manual (p. 100 in this pdf). Basically there are four possibilities, plus two special cases for returning from traps. Following the SPARC conventions, 'B' is an unconditional branch while 'Bcc' is a conditional branch, and 'Tcc' is a conditional trap. (Yes, SPARC has a branch that's unconditionally not taken... don't ask why.)

SPARC InstructionsNPCNNPC
Non-control-transfer instructions, non-taken non-annulled B & Bcc, non-taken Tcc NPC NPC + 4
Taken Bcc, taken non-annulled B, CALL, RETURN, JMP NPC EA
Non-taken annulled B & Bcc NPC + 4 NPC + 8
Taken annulled B, taken Tcc EA EA + 4
DONE TNPC[TL] TNPC[TL] + 4
RETRY TPC[TL] TNPC[TL]

-DONE and RETRY are two flavors of 'return from exception' in SPARC. --Stever 19:52, 16 February 2006 (EST)


MIPS InstructionsNPCNNPC
Non-control-transfer instructions, non-taken Branches (B) & Branch-Likely(Bl), non-taken Trap NPC NPC + 4
Taken Bl, taken B, CALL, RETURN, JMP NPC EA
Non-taken B & Bl NPC + 4 NPC + 8
Taken B, Bl, taken Trap EA EA + 4
ERET, DERET TNPC[TL] TNPC[TL] + 4

-ERET and DERET are the MIPS return from exceptions. DERET is return from debug exception.--Ksewell 16:32, 17 February 2006 (EST)

It seems like the easiest way to implement this is to expose NPC and NNPC to the ISA definition and let each instruction set either or both of these explicitly as necessary (with the default behavior being no change to NPC and NNPC = NPC + 4). Note that this model should work just fine for FastCPU and SimpleCPU where each instruction executes its semantic definition atomically (no pipelining), without any additional state beyond NNPC.

I think the best way to handle this for pipelined (incl. out-of-order) execution is to treat it as an extension of branch prediction. Right now we predict NPC and catch mispredictions by comparing the predicted & actual NPC values. For SPARC and MIPS we'll have to predict both NPC and NNPC and catch mispredictions by comparing the predicted & actual values of both of them as well. (It appears that comparing the predicted & actual NNPC values might be enough, since looking at the table it's hard to see a case where you could get NNPC right but NPC wrong. But I wouldn't count on that.) The prediction mechanism really just needs one additional bit to distinguish among the first four cases above as opposed to the current two cases (NPC = PC+4 vs. NPC = EA).

I believe it's impossible to get the NNPC right and NPC wrong since when the CPU model sees a branch it makes the prediction for that branch in the fetch stage. If that branch is taken then the other instructions aren't even allowed into the pipeline. --Ksewell 18:37, 16 February 2006 (EST)
I'm more worried about a situation where there's a branch in a branch delay slot or a return from exception into a branch delay slot or something like that... things that MIPS generally doesn't have to worry about but SPARC does. Just to be safe we should always compare both. --Stever 19:52, 16 February 2006 (EST)

Other ideas

1. Keep some type of 'condition-code'-type register(s) in the miscRegs file.

2. For branch delay slot instructions, we can add an extra source register to them. The source register would simply be the destination register of the preceding branch.

I think #2 would work and the extra source register could be added in the rename stage...

The catch here is that just looking at the target may not be sufficient, depending on how you do it. For example if you had this:
then the 'add' could look at the branch target (NPC) but it would be 0x1008 regardless of whether the branch was taken or not taken.

The only drawback is that it wouldnt necessarily allow a branch delay slot instruction to execute the same cycle as it's branch and that kind of defeats the purpose of having delay slot instructions.

Delay Slot Instruction Mips Instructions

Just to clarify: delay slots really make sense only for single-issue in-order pipelines without BTBs, where the delay slot instruction executes in the cycle *after* the branch, hiding the branch-taken bubble that comes from not having a BTB. In this situation the squashing delayed branches are not that hard to implement, as the branch always resolves before the delay slot instruction hits writeback, so you can execute it speculatively and just not write back its result. They were never intended to be executed in the same cycle as the branch.

CPU independent implementation

Should code for these be implemented in the CPU with #define MIPS or #define SPARC ... Or should there be architecture-specific files and/or functions for this branch-delay code?

It seems like the main question is do we need an NNPC or not. We could probably cover most cases by putting in an NNPC but not using it when we don't need it (e.g. in Alpha), though this could be confusing. Unless we do extra work to maintain NNPC in Alpha though there will be a difference when it comes to doing branch predictions in pipelined CPU models (see above). If we can generalize both MIPS and SPARC to a unified model, then we could have a single flag that enables or disables that model and set it in MIPS and SPARC and leave it cleared in Alpha.


Would it be possible for the PC to be considered a control register which lives in the ISA? There could be special getPC/stepPC/whatever functions to hook into for the CPU, and NPC and NNPC could be maintained totally internally to the ISA and wouldn't exist if they aren't needed. I'm not sure how this complicates things for the CPU model, though. --Gblack 13:26, 16 February 2006 (EST)
It's not entirely clear what would be less complex. Regardless of whether or not the PC is a control register, there will still have to be differences between the branch prediction in a pipelined model. One problem with the PC and NPC and NNPC living totally in the ISA is with dynamic instructions. Currently they hold a variety of information that is necessary to keep track of with the instruction, such as PC and NextPC. If you were to push the PC/NPC/NNPC to the ISA level, then you'd also have to define within the ISA some sort of struct that is all the ISA state information that needs to be carried along with a dynamic instruction. DynInst already has quite a bit of indirection to it, so I'm a bit hesitant to add another level of indirection. It may also be difficult to get the interface correct; for certain registers (such as the PC) you'd have to call xc->getISAState()->setPC(pc), while other registers would just be xc->setIntReg(5, 1000). It'd be nice to keep them uniform so that the programmer doesn't have to keep checking if the register is in the ISA or can be accessed normally. On the other hand, this ISA-dependent state struct may be useful to store, for example, updates to IPRs which won't happen until commit time. I'm really not sure at the moment...it could be useful, or it could just be more complication. --Ktlim 15:44, 16 February 2006 (EST)

Delay Slot Instruction Mips Bike Helmet

I'm fine with RegFile being an ISA-dependent type. As long as it stays that way then each ISA can choose whether to have an NNPC or not. They can all export get/set NNPC functions and the Alpha versions can panic. If we were going to have ten different ISAs with 8 different PC/NPC schemes it might be worth more effort, but if there are just two schemes and no third one on the horizon that I can see it's not worth the effort to get all abstract about it, especially for something so common and potentially performance sensitive. --Stever 21:09, 16 February 2006 (EST)

Hazard Clearing Instructions

MIPS has hazard clearing instructions which clear instruction and execution hazards (most of the hazards have to do with the special system (or CP0) registers). MIPS execution hazards are defined as 'those created by the execution of one instruction, and seen by the execution of another instruction' and instruction hazards as ' those created by the execution of one instruction, and seen by the instruction fetch of another instruction'. A table of the hazard clearing instructions is presented below and a list of all the hazard cases (e.g. disable-interrupt (producer) to interrupt-instruction(consumer)) can be found in 7.0 of vol.III of the MIPS manual.

The reason I mention this on this WIKI page is because the jalr_hb and jr_hb instructions require the hazard clearing to take place after the branch delay-slot gets executed. I was thinking to do one of two things:

1. Schedule a CPU event for cycle N+2 (where N is the cycle of the jump instruction executing). However, this solution quickly breaks down if the branch delay-slot instruction takes more than one cycle.

2. Create a flag in the CPU for a pending hazard clearing.

With respect to the Simple CPU model, I am not sure if a complicated solution is needed in the first place so I would be up for implementing #2. Currently, I have these hazard clearing instructions call a function called 'clear_exe_inst_hazards()'. This function can be configured later depending on our design decision.

Delay Slot Instruction Mips Helmet

Thoughts???

Delay Slot Instruction Mips Software

Retrieved from 'http://www.m5sim.org/wiki/index.php?title=Branch_delay_slots&oldid=2109'