This post is a follow up on the last post about Alphatronic’s P2 CPU speed. I created back then a sound that was supposed to be exactly one second long, but it actually was 1.28 seconds long. So, something I was doing wrong…

Wait state?

But I already got a clue: maybe a so called “wait state” is added, because the memory is too slow for the fast 3 MHz CPU? I’m not sure anymore where exactly I got this clue, but in the datasheet for the 8085AH, there is a sample circuit provided to insert a wait cycle. And in the “Intel 8080/8085 Assembly Language Programming” reference under “Timing Information” (page 3-1):

This basic timing factor can be affected by the operating speed of the memory in your system. With a fast clock cycle and a slow memory, the processor can outrun the memory. In this case, the processor must wait for the memory to deliver the desired instruction or data. In applications with critical timing requirements, this wait can be significant. Refer to the appropriate manufacturer’s literature for memory timing data.

So, in case the system has slow memory, this slows down the CPU as well.

Machine cycles, T-states

First we should clarify some terms: The datasheet prominently advertises at the summary on the first page “1.3 µs Instruction Cycle (8085A)”. This is the fastest instruction fully executed (e.g. DCR - decrement). And it is a 3 MHz CPU clock speed. Each instruction cycle consists of one or more “machine cycles”. The machine cycles are well defined and can be one of seven, as shown in table 3 “8085A Machine Cycle Chart”.

Here are the different machine cycles:

  1. Opcode Fetch (OF)
  2. Memory Read (MR)
  3. Memory Write (MW)
  4. I/O Read (IOR)
  5. I/O Write (IOW)
  6. Acknowledge of Interrupt (INA)
  7. Bus Idle (BI)

Each machine cycle in turn consists of one or more T-states. The datasheet says, that normally each machine cycle consists of 3 T-states, except for Opcode Fetch, which has either four or six T-states. A T-state is the smallest measurement unit - this is one clock cycle. In this case the clock is running with 3 MHz, so one T-state is 333ns. And the DCR instruction consists only of one machine cycle - Opcode Fetch. This is a simple instruction, so it only takes 4 T-states. DCX (the 16-bit decrement) on the other hand takes 6 T-states. But 4 T-states are 4*333ns which is 1.3µs. There are in total 10 different T-states, one of which is called “T_wait”.

Tone generator with reviewed timings

So, after we have clarified the terms, we should review the last program. First let’s have a look at the timings of the used instructions:

Instruction Cycles (== machine cycle) States (== clock cycle) Page
DCR 1 4 3-20
JNZ 2 or 3 7 or 10 3-29
MVI 2 7 3-37
OUT 3 10 3-41
RET 3 10 3-48

And here’s the program again:


; address       machine code    mnemonic        comments
F000            06 05           MVI B, 5H                       ; 7 cycles
F002    G       0E DC           MVI C, DCH      ; 0xDC=220      ; 7 cycles
F004    C       3E 01           MVI A, 1H                       ; 7 cycles
F006            D3 12           OUT 12H         ; 0x12=18       ; 10 cycles
F008            16 F3           MVI D, F3H      ; 0xF3=243      ; 7 cycles
F00A    A       15              DCR D                           ; 4 cycles
F00B            C2 0A F0        JNZ A           ; A -> 0xF00A   ; 7/10 cycles
F00E            3E 00           MVI A, 0H                       ; 7 cycles
F010            D3 12           OUT 12H         ; 0x12=18       ; 10 cycles
F012            16 F3           MVI D, F3H      ; 0xF3=243      ; 7 cycles
F014    B       15              DCR D                           ; 4 cycles
F015            C2 14 F0        JNZ B           ; B -> 0xF014   ; 7/10 cycles
F018            0D              DCR C                           ; 4 cycles
F019            C2 04 F0        JNZ C           ; C -> 0xF004   ; 7/10 cycles
F01C            0E DC           MVI C, DCH      ; 0xDC=220      ; 7 cycles
F01E    F       3E 00           MVI A, 0H                       ; 7 cycles
F020            D3 12           OUT 12H         ; 0x12=18       ; 10 cycles
F022            16 F3           MVI D, F3H      ; 0xF3=243      ; 7 cycles
F024    D       15              DCR D                           ; 4 cycles
F025            C2 24 F0        JNZ D           ; D -> 0xF024   ; 7/10 cycles
F028            3E 00           MVI A, 0H                       ; 7 cycles
F02A            D3 12           OUT 12H         ; 0x12=18       ; 10 cycles
F02C            16 F3           MVI D, F3H      ; 0xF3=243      ; 7 cycles
F02E    E       15              DCR D                           ; 4 cycles
F02F            C2 2E F0        JNZ E           ; E -> 0xF02E   ; 7/10 cycles
F032            0D              DCR C                           ; 4 cycles
F033            C2 1E F0        JNZ F           ; F -> 0xF01E   ; 7/10 cycles
F036            05              DCR B                           ; 4 cycles
F037            C2 02 F0        JNZ G           ; G -> 0xF002   ; 7/10 cycles
F03A            C9              RET

The one second is supposed to be one loop execution “G” from 0xF004 to 0xF037 including all the inner loops.

If we count the needed clock as is, we get:

7+7+220*(7+10+7+243*14+7+10+7+243*14+14)+7+220*(7+10+7+243*14+7+10+7+243*14+14)+14
= 3021075
= 1.007025 seconds

So, that’s roughly supposed to be our 1 second.

Now let’s add naively one additional clock cycle (or say one T_wait state) to each instruction. This means, that DCR has now 5 cycles instead of 4 and so on.

8+8+220*(8+11+8+243*16+8+11+8+243*16+16)+8+220*(8+11+8+243*16+8+11+8+243*16+16)+16
= 3452280
= 1.15076 seconds

So, that’s a bit more than 1 second, but still not our measured 1.28 second. So, that can’t be the solution.

How many wait states need to be added? When exactly are they added? The reference circuitry for “Generating an 8085A wait state” (Page 1-16) talks about “to insert one WAIT state in each 8085A machine cycle”. Note: each machine cycle. Not per instruction.

So, let’s calculate again. DCR again has now 5 clock cycles instead of 4, because it consists only of one machine cycle. But MVI now has 9 clock cycles instead of 7, because it consists of two machine cycles: One opcode fetch to read the opcode from memory and one memory read to read the immediate operand from memory. Typically OF takes 4 cycles and MR takes 3 cycles. With wait states it is 5+4=9 cycles instead of 7. OUT has OF, MR, IOW which is 4+3+3=10 and with wait states 5+4+4=13. JNZ has OF, MR which is 4+3=7 or OF, MR and MR 4+3+3=10. With wait: 5+4=9 or 5+4+4=13. If the condition for the jump is not met (which is known at the 2nd machine cycle), then the high byte of the address to jump to is not read at all - which saves some clock cycles.

Now the same calculation again with one wait state per machine cycle:

9+9+220*(9+13+9+243*18+9+13+9+243*18+18)+9+220*(9+13+9+243*18+9+13+9+243*18+18)+18
= 3884365
= 1.294788333

So, that’s now very close to the 1.28 second which we measured.

Solution?

This sounds like this is the solution - there are additional wait states added which makes the whole system run slower (but stable of course).

If we apply this information to the floppy index hole measurements program, we get the following:

LPS=103448! ‘loops per second: 3 MHz/29 cycles - dur. of one loop

becomes with “7 (INX) + 13 (IN) + 5 (ANA) + 13 (JZ) = 38”

LPS=78947! ‘loops per second: 3 MHz/38 cycles - dur. of one loop

As seen on the photo in that previous post, the raw counter result was about 16000, so: 60*78947/16000 = 296 rpm. So, that’s indeed more close to the real number.

Conclusion

The question remains now: How can we verify, that this indeed is what is happening here?

The wait state is entered, when the READY signal is used. So monitoring the READY signal with an oscillosocope could be one way. Another way would be to have a close look at the chips that are used. Maybe the wait state circuitry can be identified. At least, the datasheets of the RAM and ROM chips can give information about the maximum expected access times.

The 8085 datasheet also has some timing requirement specified: t_LDR is the time from “ALE to Valid Data During Read”. This must be not more than 460ns. This maximum time can also be calculated depending on the clock cycle time T (333ns): t_LDR=(4/2)*T-180=486ns. So, if the ROM/RAM chips can provide the data faster than that, no wait state is needed. Otherwise, the CPU needs to wait.

References