;-*- Mode:Text -*-

Set associative means cache is addressed by part of access address.
There is one comparator per set, and probably one chip-depth of memory.
Each set is addressed by the access address; the tag field in the
addressed location is compared with the rest of the address bits,
and if the tag is valid and matches, the cache data field is returned
for a "hit".

Only one of the sets can possibly contain a valid match, so multiple
sets work without arbitration.  The "hit" out from each is or'd to
give the "ready" response to the instruction stream.

For a "miss", the instruction stream is frozen while the cache is filled.
 1.  Fill whole block of 2^N words, then restart original access.
	Simplest logic; slowest access to desired word, fastest access
	to adjacent words.
 2.  Read desired word; return "ready" as data is present, then
	fill in rest of block.  Processor is not delayed for rest
	of block if words are read in ascending order, but timing
	is complicated in insuring that "ready" can be returned
	for a just-comleting cache-fill cycle even if cache does not
	contain the data yet.
    Optimization: start cache-fill at desired word, stop at block-size
	boundary, and have separate valid bit for each word in block.
	Is fastest but hairiest.
    Suggests that cross with (1) is better: fill in whole block,
	but return "ready" as desired word is written.
	Note that RAM with common I/O causes "hit" hardware to compute
	a valid "hit" response on data being written into the cache;
	suggests that fast ready response is cheap.
    Following words continue to be filled in; cache is busy being filled
	as next processor fetch appears, but same incoming-hit response
	works for this too.  Is easier if cache is synced with machine cycle.

Tentative machine cycle is 70ns.
256K DRAM access is 120, cycle 230, nybble-burst 60.
1M is 100, 200, 50.

If synced with machine cycle, access is 140, cycle 280, nybble 70.
Easier said than done; cache cycle is skewed from machine clock.

Synced time for 4-word load is 140 + 3*70 = 350, not counting precharge of 140 (total 490)
Max time for 4-word load is 120 + 3*60 = 300, precharge 110 (total 410) (350 / 420)
For 1M is 100 + 3*50 = 250, precharge 100 (total 350)
Overall synced 1M is 280; with precharge 350 (420?) (280 / 350; probably 280 / 420)
128-bit parallel 1M (1M x 16 bytes = 16MB) cache fill in 100, with precharge 200. (140 / 210)

Sequential cache-miss-and-fill:
First takes 140; rest take 70, precharge 70.


If processor wants successive words as they are filled in, it gets
them with no delay at all.  But if the block size is one word,
the cache becomes fragmented and the advantage of the burst mode
is lost.  So, it is better to stick with the 4-word blocks and
freeze the processor to load all four.

Hmmm.

Easiest thing to do is freeze the machine with the current address
asserted; then no cache addr mux is required.  Can do 50ns this way.

Note that all times include the 70ns that is allowed anyway.

(1)
Freeze processor until whole block is filled in.
Saves cache addr mux time.
Cache miss:
	70ns	normal cache cycle; detect miss
	140	first word access time
	3 * 70	load rest of 4-word burst
	70	repeat normal cache cycle to get desired data
	===
	490	access time for miss; DRAM is always idle.

(2)
Release processor as soon as desired word is filled in.
Requires cache addr mux to finish cache-fill if next addr
from processor is not the expected addr; requires addr comparator
to freeze processor if addr mux inputs don't match.
Max benefit ...
Cache miss from idle DRAM:
	70ns	normal cache cycle on first word; detect miss
	140	first word access time
	N * 70	load rest of 4-word burst,
		but release processor as soon as desired word is read.
	===
	210 - 420
		access time for miss, from idle DRAM.
		For in-line code, miss is on first word = 210ns.
		For jump, avg. is 315ns.

Cache miss on word that is being filled in:
	===
	70	access time for next word of burst (free)

Cache miss from busy DRAM:
	N * 70	load rest of 4-word burst
	140	precharge, overlaps 70ns cache cycle to detect miss.
	140	access first word
	N * 70	load rest of 4-word burst, up to desired word
	===
	280 - 700
		Random jump: average; 490ns.
		In-line miss: best case; 280ns.

Method (1) has fixed miss time of 490ns for jumps and in-line misses.
method (2) has miss time of 490ns for jumps, 210-280 for in-line misses.

Most misses are (?) for jumps, so the in-line code advantage of (2)
is probably not worth the extra hardware.

;;;;;;;;;;;;;;;;

5/4
current plan:
load Icache in 32-bit accesses:
	70ns	detect cache miss
	140	first word access
	7*70	rest of 8 32-bit words
	70	cache hit
	===
	770ns	cache miss

averaged over 4 instructions:
	192ns / instruction (instead of 70)

load Icache in 64-bit accesses:
	70ns	detect cache miss
	140	first word access
	3*70	rest of 4 64-bit words
	70	cache hit
	===
	490ns	cache miss

averaged over 4 instructions:
	122ns / instruction

;;;;;;;;;;;;;;;;

two ways of doing Icache:

15ns tag ram / 25 ns data ram:
	55 ns total
	1K x 10 x 2 sep I/O 15ns cache tag ram
	4K x 64 x 2 common I/O 25 cache data ram
	saves 16 '374s plus pins on cache data ram
requires hairy Cypress ram chips.

25ns tag ram / 25ns data ram:
	50 ns total
	1K x 10 x 2 sep I/O 25ns cache tag ram
	4K x 64 x 2 sep I/O 25ns cache data
	requires 16 '374s plus I/O pins on cache data
still requires 25ns sep I/O chips.

;;;;;;;;;;;;;;;;

Other Icache issues:

parity:
compute parity on write
check parity on read
timing doesn't matter -- signal error during next instruction.

Run from ROM:
anything better than 299's?  Shift 64 bits into parallel outputs,
then clock into IR.

Accessed bit per page; spy path

Do separate input/output 299-style shift registers exist?

;;;;;;;;;;;;;;;;

disabling cache?

run just from IR, or require at least cache data to work?
	IR ...

;;;;;;;;;;;;;;;;

cache block size of 4 words ok?
probably good for Icache;
maybe 2 for Dcache?
	cons = 2 words, even aligned
	symbols = 5 words ...

;;;;;;;;;;;;;;;;

data cache:

complicated by cache-before-maps

Simlest:
Can't write valid cache entry unless data memory is also written,
so don't WRITE directly into cache on a write.  DO invalidate
the cache entry.
Which cache entry?  Detect a hit, and if HIT and WRITE, rewrite
cache to invalidate.

Better:
only invalidate cache on write if the maps show that the real
word can't be written.  Assumes that rest of cache block is
already correct.  Good, because it prevents continuous faults
on blocks that are read and written a lot.

;;;;;;;;;;;;;;;;