13

Technically this isn't just about video since it applies to any regularly scheduled DMA¹ from a non-CPU subsystem, but video is the most common application of this technique so I'll use that as the example.

The 6502 needs access to memory only half the time, during the what is usually called the ϕ2 phase of the clock. The other half of the time, so long as you tri-state the CPU address pins,² the bus and memory are available to other systems. This is often used for video systems that have a frame buffer in RAM directly accessible to the CPU; the CPU updates the frame buffer during ϕ2 and the video system reads the buffer during ϕ1, and there is never contention for memory access between the two.³

The Z80 does not have such a well-defined, synchronous system of accessing RAM. Many cycles don't need RAM access, but many do, and which particular ones do and do not depend on the instruction mix. Thus, at least on many early systems, when the frame buffer was in memory shared by the CPU and video subsystem (and the memory was not dual-ported) there was the possibility that both would try to access the memory at the same time and one would have to be denied access. This could be done by pausing the CPU, thus slowing down program execution, or by denying access to the video system, which would cause effects such as snow on the screen because the data to be displayed could not be retrieved.

Given a subsystem such as a video display that needed a certain amount of guaranteed bandwidth and had its frame buffer in single-ported RAM directly accessible by the CPU (i.e., the CPU could execute code from that RAM), was it possible on the Z80 to set things up so that the subsystem would always get that bandwidth and yet the CPU would never be paused or slowed, as can be done on the 6502? What sort of compromises, if any, would be involved in doing this? (By "compromise" I'm thinking of differences in behaviour that you would not have had if you had used dual-ported RAM to share the memory.)


¹ Wikipedia gives a good definition: "Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory (random-access memory) independently of the central processing unit (CPU). ¶ Without DMA, when the CPU is using programmed input/output, it is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU first initiates the transfer, then it does other operations while the transfer is in progress, and it finally receives an interrupt from the DMA controller (DMAC) when the operation is done."

² On some versions of the 6502, such as the Commodore 64's 6510, there is a signal to request that the CPU do this itself. On others, such as the Apple II's 6502, external tri-state buffers are needed.

³ There were some systems, such as the Commodore 64, where more than half the memory bandwidth was sometimes needed, and from time to time these would "steal" cycles from the CPU as well as using all the ϕ1 time. But many, such as the Apple II, were designed so that the video system never needed to do this.

9
  • 7
    It's a mistake to think that the 6502 video systems had no impact on the CPU. They all stole power from it in one way or another. Many of the system demonstrate significant benchmark improvements with the video "off". And it's not just interrupt servicing. Many of the early computers primary task was servicing video. The fact that we got any cycles to do real work was a coincidence. Video on these systems has always been expensive, and typically the drum beat to which the rest of the system marched. Commented Nov 13, 2021 at 21:46
  • 2
    @WillHartung I covered this in footnote 2. There were plenty of systems where you could remove the entire video subsystem and it would not get any faster. I don't believe that there is a single 6502 system using a ϕ1 video scanner scanning a frame buffer in main RAM that is greatly slowed by the video system. The whole point of that design is that the CPU doesn't do the video generation, of course. I suspect you are thinking of entirely different video systems from the type described in this question. (Atari VCS, perhaps?)
    – cjs
    Commented Nov 14, 2021 at 1:14
  • As for video being "the drum beat to which the rest of the system marched," yes, of course. I thought that trivial enough not to mention. If your ϕ1 clock is designed around your video circuitry, the complementary phase ϕ2 naturally must be as well. As one example, this determined the Apple II's clock speed, which was slightly different for NTSC and PAL versions.
    – cjs
    Commented Nov 14, 2021 at 1:19
  • 2
    @Spud: The clock stretching on the Apple II causes every 65th clock cycle to be four chroma clocks long rather than 3.5; interestingly, it would use standard scan rate precisely if it didn't add such clock stretching, but standard NTSC would have the chroma phase alternate on every scan line. Stretching the scan line from 227.5 chroma clocks to 228 means that showing a solid color will require showing a stripe pattern rather than a checkerboard.
    – supercat
    Commented Nov 14, 2021 at 20:46
  • 1
    @Spud: If one doesn't mind having to generate different display signals on alternate field, using 227.5 chroma cycles per line and 263 lines per frame will offer better appearance than using 228 chroma per line. What didn't you like about it?
    – supercat
    Commented Nov 17, 2021 at 17:35

6 Answers 6

17

Given a subsystem such as a video display that needed a certain amount of guaranteed bandwidth, was it possible on the Z80 to set things up so that the subsystem would always get that bandwidth and yet the CPU would never be paused or slowed, as can be done on the 6502?

No, at least not unless the video runs way slower than the CPU. And even in that case it might need buffering and/or insert wait states in fringe cases.


While the 6502 is essentially double clocked, using memory only every other cycle, the Z80 has a less strict format and only one dedicated clock cycle in each instruction that can be used for transparent access if certain conditions are met and that's during the third and fourth clock cycle (T3/T4) of the first machine cycle (M1) of each instruction.

This is where a Z80 usually puts out the RAM refresh address marked by /RFSH during these two cycles, but ignore any returned data. It would be possible to use this duration to execute video access. There are two main consideration when going this way:

  • M1 cycles come in different intervals
  • When using dynamic RAM, refresh has to be provided by different means.

While the second can be easily satisfied by crafting a video address layout that spreads out the video memory in a way making sure every row gets accessed in time, the first point holds the real challenge, as instructions are of different length, mixed in random sequence (*1).

Instructions come in machine cycle combinations in likeliness of

  1. M1 only (like NOP or MOV A,B)
  2. M1 and up to 4 following M-cycles
  3. Dual M1 and up to 4 following M-cycles (all prefixed instructions)

If it would be only M1 cycles, the situation would be almost exactly like with a 6502, but the second case (*2), instructions with additional machine cycles, complicates this because

  • only the M1 cycle is 4 T-cycles, all other are 3 T-cycles.
  • the number of T cycles between two consecutive M1 varies between 4 and 16 clocks

Any solution to go without slowing the CPU will have to cover both and have at least one 'read ahead' buffer to equalize phase shifts introduced by different M-cycle length. If 'empty' this register would read the next byte to be displayed with the next M1 cycle coming along.

Let’s do a quick back-of-the-envelop estimate for a 40x25 character display, equal to 320 pixel in B&W. On a 15 kHz screen this means reading 40 bytes per scan line within ~50 µs, or one byte every 1.25 µs. Since the maximum time between two M1 cycles on a Z80 is 20 clocks, such a display system would work fine with a CPU clock rate of 16 MHz.

So yes, a complete transparent display access can be done, but CPU clock speed requirements are rather harsh, even if it's just about a home computer-like display (40x24 character or 320x240 in black-and-white).

A variation with some impact could use such a one byte buffer and a lower clock rate by stopping the CPU only if the buffer runs empty right before the next byte is to be fetched. How much slow down this produces is quite up to each programs instruction mix, much related to estimations of cache hit strategies.

It's up to the reader to find case dependent estimations.

And like with a bigger cache, one could use a FIFO of for example 2 or 4 byte wide buffers, increasing buffer hit situations quite a lot. I seems plausible that with a 4 byte FIFO such a display could run bound to a 4 MHz CPU with no or only very little impact - not only because it smooths out the instruction length quite good, but it also increases access time as the FIFO will be filled during blanking, essentially building up a time buffer of 16 clocks.

Of course, all of this comes with increased chip count. It needs to be carefully checked when the effort needed for buffering gets higher than adding a conditional dual port access. One which may only slow down the CPU if accessing during a displayed line, a rather rare occurrence (*3).

Now, if this is about a computer design intended to be sold in high numbers, a controller with an on chip FIFO could tip the scale again. Here buffer size can be much larger, maybe a whole scan line, which should allow 80 char (640 px) on a 4 MHz Z80 without much delay. It is kind of like the VIC, but without the slowdown.


Having said all of that, I would either

Go the way of adding wait states when the CPU is accessing video.

The performance decrease is minuscule, as it only extend the accessing M-cycle by maybe 0-4 T-cycles. So, even within an LDIR instruction, slow down due concurrent access will be less than 20%.

Or make all CPU access go through an access port instead.

If the port is 'empty' writing will be immediate, if not, the CPU will be put on hold. This should be quite rare, as the port will usually be able to write to video RAM within 4 CPU T-cycles (*4), so way faster than the Z80 can issue a follow up write. Even LDIR needs 21 T-cycles between two consecutive write.

Reading will be slower and introduce a 0..4 T-cycle wait penalty to synchronize for access. Then again, reading happens way less often than writing, so it might not be of great concern.

This is in also the way all TI9918 compatible VDPs are interfaced and a reason why the 9918 was so widely used with Z80 machines. Of course it only works with dedicated video memory.


*1 - From the point of view of the video logic that is.

*2 - The third case can be ignored as it appears as a combination of a type 1 followed by a type 2.

*3 - As usual depending on application type)

*4 - Assuming a timing like a similar 6502 setup.

5
  • 2
    Nice work; this is exactly the kind of analysis I was looking for. It's probably worth noting that as you increase buffering you are effectively creating frame buffer memory that is no longer shared with the CPU, so that's starting to move away from the "fully shared memory" type of system. (How far one can go down this road before deciding that the memory isn't shared enough any more is of course a matter of opinion, which is why I'm glad you investigated this route.)
    – cjs
    Commented Nov 13, 2021 at 16:26
  • 1
    Well, kinda. While technically any buffer can be counted as 'more screen memory' ist's not really memory, but a pipeline. More importa,t, even with 80 bytes it's tiny compared to screen memory. For a text page it's about 4% increase, for graphics it's less than .5%. So no, not really. Personally I'd rather go the usual way of wait states for the CPU accessing video. Added a section for that.
    – Raffzahn
    Commented Nov 13, 2021 at 20:03
  • 1
    "While the second can be easy satisfied by crating a video address layout that spreads out the video memory in a way making sure every row gets accessed in time, the second holds the real challenge, as instructions are of different length, mixed in random sequence" Is that a typo -- should it be "the first holds the real challenge"?
    – NobodyNada
    Commented Nov 14, 2021 at 0:05
  • Using RAM mediated by a port (I assume you're thinking TMS9918-style) doesn't count for this question. I'd mentioned "directly accessible" RAM near the top, but I restated it again at the bottom so that's more clear. Also, if you're going through another device, you could add buffering at that point so that there are no delays at all.
    – cjs
    Commented Nov 14, 2021 at 1:05
  • How about designing a system with a 1kbyte RAM which is connected to the Z80 bus during vertical blank, along with circuitry so that during refresh cycles address bits 7-10 come from bits 3-5 of the scan line counter, and which is connected to display circuitry during the displayed frame? A 4MHz Z80 would generate at least 128 refresh cycles during any group of eight scan lines, so 64 scan lines of vblank could copy 1024 bytes of data from main RAM to the display without interfering with Z80 operation at all.
    – supercat
    Commented Nov 17, 2021 at 17:43
6

TL/DR; As you stated in the question:

"Many cycles don't need RAM access, but many do, and which particular ones do and do not depend on the instruction mix.

There is no "do nothing" solution to this constraint, and Z80-based systems that wanted consistent video output would have to solve this one way (hardware) or the other (software).


The best solution, and the one used when consistent video output was paramount, was to simply add a video subsystem. Here, a display processor with its own RAM ("VRAM") would provide a hardware controller for the CPU to access the non-shared memory. See, for example, Sega's Z-80 based Master System game console.

Eschewing such a hardware solution meant that a simpler synchronization mechanism would be implemented to block either the display or the CPU as needed, just as you describe in the question. So, through software, a consistent display would be created by the programmer carefully managing the CPU access to the shared display memory.

It seems that no "third way" is either necessary or possible.

10
  • Third way would appear later: dual port memory.
    – Joshua
    Commented Nov 14, 2021 at 0:37
  • @Joshua Many DPRAMs want the second system to wait, if they are accessed at concurrent times. Commented Nov 14, 2021 at 12:16
  • 1
    @thebusybee: In the asynchronous-read DPRAMs I've looked at in the 1990s, if a read overlapped a write, if a bit which held a 1 was read while it was being written with 0, or vice versa, then that bit in the read side's output could change arbitrarily between 0 and 1 during the read cycle. Waiting in such cases would be necessary if one wants to ensure the read either yields entirely old data or entirely new data, but not if any arbitrary mix of old and new data would be equally acceptable.
    – supercat
    Commented Nov 15, 2021 at 20:50
  • @supercat Sure, that is possible, but the use cases are rare, aren't they? The DPRAM emits wait signals to delay the access cycle of the inferior side, so in most implementations it's simple to wait. Commented Nov 15, 2021 at 21:14
  • @thebusybee: Many systems can be made simpler if timings are predictable than if they aren't. On an asynchronous-read part, sampling a byte three times and then computing (read1 | read3) & read2 may be simpler than adding wait conditions, especially if there's a guaranteed limit on the difference in propagation delay between bits within a single part.
    – supercat
    Commented Nov 15, 2021 at 21:41
5

If the memory works at the same speed as Z80 (DRAM cycle time is 1 CPU clock cycle), DMA can have the memory at the speed of least 2 cycles out of 3, presuming that every Z80 memory access gets exactly single DRAM cycle.

This idea was implemented in some russian ZX-Spectrum clones, among which the most known one is "Pentagon".

A similar idea is used in Amiga computers, where 68000 CPU is able to run off chip memory without wait cycles (provided there's not too many bitplanes and other DMAs are on).

upd: As @cjs has some doubts in my statements, here is an example diagram containing clock, Z80 cycles, some Z80 signals and how DRAM cycles could be scheduled. The pattern shown is valid for an instruction like LD IX,[addr].

Z80 cycles vs DRAM cycles

upd2: In pentagon ZX clone, video fetch requires at least 2 bytes in every 4-clock video cycle (which maps to 8-pixel byte shown on the screen). It could be seen from a diagram, that video will always have that amount, with just wasting possible additional accesses.

6
  • What kind of memory speeds compared to Z80 speeds are we talking about. For typical early 2 MHz and later 4 MHz Z80 systems, were typical contemporary memory speeds up to that, or would one have to use much faster memory than was commonly used?
    – cjs
    Commented Nov 15, 2021 at 23:59
  • Obviously for 2MHz Z80 the same 500 ns cycle time memory as used in C64 would fit. For 4Mhz, one needs 250 ns cycle time memory.
    – lvd
    Commented Nov 16, 2021 at 2:37
  • That seems to contradict Raffzahn's answer, where he said the video would have to run "way slower than the CPU. And even in that case it might need buffering and/or insert wait states in fringe cases." What am I missing here? Perhaps you could edit your answer to add the cycle sequences for this, so they could be compared with Raffzahn's?
    – cjs
    Commented Nov 16, 2021 at 6:46
  • 1
    Related retrocomputing.stackexchange.com/questions/7655/… Commented Nov 16, 2021 at 19:58
  • @cjs so I updated my answer with a diagram.
    – lvd
    Commented Nov 17, 2021 at 15:52
2

To answer the specific question in a very literal way:

was it possible on the Z80 to set things up so that the subsystem would always get that bandwidth and yet the CPU would never be paused or slowed, as can be done on the 6502?

Yes. The Galaksija is a computer that does this. The display subsystem gets full bandwidth and the CPU is not technically slowed down.

What sort of compromises, if any, would be involved in doing this? (By "compromise" I'm thinking of differences in behaviour that you would not have had if you had used dual-ported RAM to share the memory.)

The compromise is limiting the opcodes that may be executed while the display is running.

The way this works is that the display generation hardware is set up to read the byte from the screen buffer directly off the databus during T3. At this time, the Z80 does a read from an address involving R, which is designed for DRAM refresh. The Galaksija hardware rigs this to get the character from the screen buffer (this byte is looked up in a character ROM, and the result is clocked out for the next eight pixels).

As you can imagine, variation in the timing of the instruction stream is going to be detrimental to the picture on the screen here. So the machine code running at this time needs to be very carefully timed. It only uses M1 cycles for the 32 cycles where the computer is generating the active part of the display (i.e., 31 single-byte opcodes followed by one more instruction).

Parts of this same machine code happens to also be useful data. There is the string BREAK, and the floating point constant 1.0. This same code also counts the scanlines, so it can set the hardware up to start displaying the next one. And of course, it keeps track of when it needs to break out of this loop.

8
  • Except, it does not run any user code during this time. It's more of a variation of the cheap video / ZX80 method, making the CPU part of the video generation.
    – Raffzahn
    Commented Nov 16, 2021 at 11:41
  • @Raffzahn, yes, but OP also doesn't mention concurrently running the user code# Commented Nov 16, 2021 at 13:27
  • Well ... that's a very fine line to dance about, isn't it? :)
    – Raffzahn
    Commented Nov 16, 2021 at 13:55
  • @Raffzahn It's not actually a fine line at all because in the very first paragraph I do indicate that the CPU should be running arbitrary user code concurrently with the video transfer. I've added a footnote quoting the Wikipedia definition of "DMA", which describes the definition I'd intended.
    – cjs
    Commented Nov 16, 2021 at 14:22
  • To be clear, if there's a trick involving the built-in refresh system of the Z80 generating (or helping to generate) the video addresses, that could fall under the definition of "DMA," so long as the CPU is still running arbitrary user code. But it sounds to me here as if the CPU is simply doing the video I/O, or at least generating the addresses for the video I/O system in a way that essentially falls under the definition of "programmed I/O." (But I could be misreading this answer.)
    – cjs
    Commented Nov 16, 2021 at 14:25
1

The Amstrad CPC range of Z80 machines 'solved' this problem by interleaving RAM reads by the video gate array with Z80 read and writes.

As some Z80 M-states last 3 clock cycles, and some last 4 clock cycles, the video gate array uses the WAIT line to pause the Z80 every time it accesses video memory. This doesn't affect the 4 cycle states, but effectively stretches the 3 cycle states to 4. Because of this, the 4MHz Z80 appears to run at around 3.3MHz.

1

To add to the Spectrum answer: the transparent video access was used in Spectrum 128, not just the clones. Not that it matters. Indeed the method used required memory access time of 1 clock cycle. But there were only 32 characters per line.

Now suppose we want 64 bytes per line. Using 6 pixels per character we can even squeeze 80 characters for a CPM machine. Use 10 MHz clock for both the CPU and pixel clock. Static memory chips 128K x 8 bits with 55 ns access time are plentiful and cheap. No problem!

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .