- Wait state?
- Machine cycles, T-states
- Tone generator with reviewed timings
This post is a follow up on the last post about Alphatronic’s P2 CPU speed. I created back then a sound that was supposed to be exactly one second long, but it actually was 1.28 seconds long. So, something I was doing wrong…
But I already got a clue: maybe a so called “wait state” is added, because the memory is too slow for the fast 3 MHz CPU? I’m not sure anymore where exactly I got this clue, but in the datasheet for the 8085AH, there is a sample circuit provided to insert a wait cycle. And in the “Intel 8080/8085 Assembly Language Programming” reference under “Timing Information” (page 3-1):
This basic timing factor can be affected by the operating speed of the memory in your system. With a fast clock cycle and a slow memory, the processor can outrun the memory. In this case, the processor must wait for the memory to deliver the desired instruction or data. In applications with critical timing requirements, this wait can be significant. Refer to the appropriate manufacturer’s literature for memory timing data.
So, in case the system has slow memory, this slows down the CPU as well.
Machine cycles, T-states
First we should clarify some terms: The datasheet prominently advertises at the summary on the first page “1.3 µs Instruction Cycle (8085A)”. This is the fastest instruction fully executed (e.g. DCR - decrement). And it is a 3 MHz CPU clock speed. Each instruction cycle consists of one or more “machine cycles”. The machine cycles are well defined and can be one of seven, as shown in table 3 “8085A Machine Cycle Chart”.
Here are the different machine cycles:
- Opcode Fetch (OF)
- Memory Read (MR)
- Memory Write (MW)
- I/O Read (IOR)
- I/O Write (IOW)
- Acknowledge of Interrupt (INA)
- Bus Idle (BI)
Each machine cycle in turn consists of one or more T-states. The datasheet says, that normally each machine
cycle consists of 3 T-states, except for Opcode Fetch, which has either four or six T-states.
A T-state is the smallest measurement unit - this is one clock cycle. In this case the clock is running
with 3 MHz, so one T-state is 333ns. And the DCR instruction consists only of one machine cycle - Opcode Fetch.
This is a simple instruction, so it only takes 4 T-states. DCX (the 16-bit decrement) on the other hand
takes 6 T-states. But 4 T-states are
4*333ns which is 1.3µs.
There are in total 10 different T-states, one of which is called “T_wait”.
Tone generator with reviewed timings
So, after we have clarified the terms, we should review the last program. First let’s have a look at the timings of the used instructions:
|Instruction||Cycles (== machine cycle)||States (== clock cycle)||Page|
|JNZ||2 or 3||7 or 10||3-29|
And here’s the program again:
; address machine code mnemonic comments F000 06 05 MVI B, 5H ; 7 cycles F002 G 0E DC MVI C, DCH ; 0xDC=220 ; 7 cycles F004 C 3E 01 MVI A, 1H ; 7 cycles F006 D3 12 OUT 12H ; 0x12=18 ; 10 cycles F008 16 F3 MVI D, F3H ; 0xF3=243 ; 7 cycles F00A A 15 DCR D ; 4 cycles F00B C2 0A F0 JNZ A ; A -> 0xF00A ; 7/10 cycles F00E 3E 00 MVI A, 0H ; 7 cycles F010 D3 12 OUT 12H ; 0x12=18 ; 10 cycles F012 16 F3 MVI D, F3H ; 0xF3=243 ; 7 cycles F014 B 15 DCR D ; 4 cycles F015 C2 14 F0 JNZ B ; B -> 0xF014 ; 7/10 cycles F018 0D DCR C ; 4 cycles F019 C2 04 F0 JNZ C ; C -> 0xF004 ; 7/10 cycles F01C 0E DC MVI C, DCH ; 0xDC=220 ; 7 cycles F01E F 3E 00 MVI A, 0H ; 7 cycles F020 D3 12 OUT 12H ; 0x12=18 ; 10 cycles F022 16 F3 MVI D, F3H ; 0xF3=243 ; 7 cycles F024 D 15 DCR D ; 4 cycles F025 C2 24 F0 JNZ D ; D -> 0xF024 ; 7/10 cycles F028 3E 00 MVI A, 0H ; 7 cycles F02A D3 12 OUT 12H ; 0x12=18 ; 10 cycles F02C 16 F3 MVI D, F3H ; 0xF3=243 ; 7 cycles F02E E 15 DCR D ; 4 cycles F02F C2 2E F0 JNZ E ; E -> 0xF02E ; 7/10 cycles F032 0D DCR C ; 4 cycles F033 C2 1E F0 JNZ F ; F -> 0xF01E ; 7/10 cycles F036 05 DCR B ; 4 cycles F037 C2 02 F0 JNZ G ; G -> 0xF002 ; 7/10 cycles F03A C9 RET
The one second is supposed to be one loop execution “G” from 0xF004 to 0xF037 including all the inner loops.
If we count the needed clock as is, we get:
7+7+220*(7+10+7+243*14+7+10+7+243*14+14)+7+220*(7+10+7+243*14+7+10+7+243*14+14)+14 = 3021075 = 1.007025 seconds
So, that’s roughly supposed to be our 1 second.
Now let’s add naively one additional clock cycle (or say one T_wait state) to each instruction. This means, that DCR has now 5 cycles instead of 4 and so on.
8+8+220*(8+11+8+243*16+8+11+8+243*16+16)+8+220*(8+11+8+243*16+8+11+8+243*16+16)+16 = 3452280 = 1.15076 seconds
So, that’s a bit more than 1 second, but still not our measured 1.28 second. So, that can’t be the solution.
How many wait states need to be added? When exactly are they added? The reference circuitry for “Generating an 8085A wait state” (Page 1-16) talks about “to insert one WAIT state in each 8085A machine cycle”. Note: each machine cycle. Not per instruction.
So, let’s calculate again. DCR again has now 5 clock cycles instead of 4, because it consists only of one machine cycle. But MVI now has 9 clock cycles instead of 7, because it consists of two machine cycles: One opcode fetch to read the opcode from memory and one memory read to read the immediate operand from memory. Typically OF takes 4 cycles and MR takes 3 cycles. With wait states it is 5+4=9 cycles instead of 7. OUT has OF, MR, IOW which is 4+3+3=10 and with wait states 5+4+4=13. JNZ has OF, MR which is 4+3=7 or OF, MR and MR 4+3+3=10. With wait: 5+4=9 or 5+4+4=13. If the condition for the jump is not met (which is known at the 2nd machine cycle), then the high byte of the address to jump to is not read at all - which saves some clock cycles.
Now the same calculation again with one wait state per machine cycle:
9+9+220*(9+13+9+243*18+9+13+9+243*18+18)+9+220*(9+13+9+243*18+9+13+9+243*18+18)+18 = 3884365 = 1.294788333
So, that’s now very close to the 1.28 second which we measured.
This sounds like this is the solution - there are additional wait states added which makes the whole system run slower (but stable of course).
If we apply this information to the floppy index hole measurements program, we get the following:
LPS=103448! ‘loops per second: 3 MHz/29 cycles - dur. of one loop
becomes with “7 (INX) + 13 (IN) + 5 (ANA) + 13 (JZ) = 38”
LPS=78947! ‘loops per second: 3 MHz/38 cycles - dur. of one loop
As seen on the photo in that previous post, the raw counter result was about 16000, so: 60*78947/16000 = 296 rpm. So, that’s indeed more close to the real number.
The question remains now: How can we verify, that this indeed is what is happening here?
The wait state is entered, when the READY signal is used. So monitoring the READY signal with an oscillosocope could be one way. Another way would be to have a close look at the chips that are used. Maybe the wait state circuitry can be identified. At least, the datasheets of the RAM and ROM chips can give information about the maximum expected access times.
The 8085 datasheet also has some timing requirement specified: t_LDR is the time from “ALE to Valid Data During Read”. This must be not more than 460ns. This maximum time can also be calculated depending on the clock cycle time T (333ns): t_LDR=(4/2)*T-180=486ns. So, if the ROM/RAM chips can provide the data faster than that, no wait state is needed. Otherwise, the CPU needs to wait.