This is another testbed for prototyping the lightweight dynamic translation
scheme in plex86.

You can tweak parameters in the top of 'dt.h' before compiling.
I will be adding more parameters and modeling to this code over time.
Then just type 'make'.

Guest code is actually generated in 'guest.c'.  I will also
be adding different kinds of guest code sequences to that file.

The first rev of this guest code exercises static out-of-page
branches in nearly the worst possible scenario.  I generate
code which weaves a path branching back and forth between two
arbitrary guest pages.  At the end, based on 'DT_LOOP_COUNT' this
process is repeated.

I did this purposely to magnify the overhead imposed by the branch
handling shims and code as much as possible, so that we can work
on trimming them down by trying various techniques.  This is not
anywhere near realistic for code, but it's useful for development.

Some results thus far, based on the command 'time ./dt' are below.
Once this is trimmed down, I may model some more realistic guest
code.

  Native: 0.72 seconds



   DT    Slowdown factor
 54.68s  76:1              Wed Jan 10 17:04:06 EST 2001
                           First effort.  Shims require that DT
                           code be constrained to a given SS, so
                           that SS can be used to save state.  All
                           branches vectored through handler routine.

 46.90s  65:1              Fri Jan 12 00:29:15 EST 2001
                           Shims assume GS is virtualized and used as a
                           data segment for the tcode.  This lets us
                           save guest state more efficiently.  Branches
                           still going through handler routine.

  5.22s   7:1              Added dynamic direct branch backpatching.
                           The handler routines patch in the address
                           of the tcode at the branch target address
                           dynamically.  The generated code first examines
                           an inline ID token and compares it to a global
                           ID token, which increments for each context
                           switch (change of the page tables etc).
                           As long as there is a match, the direct branch
                           is taken.  This is a simple and fairly efficient
                           way to have branches revalidate direct linkage
                           to other tcode sequences across context switches,
                           while maintaining no branch tables of any kind.


Considering that the pseudo-guest code is doing nothing but
thrashing with out-of-page branches (weaving back and forth),
and there are no in-page branches (which would be 1:1) or tight-loop
code cycles, and the pipelines are never allowed to fill, the
last number is pretty good.  This is an exercise of pure overhead.

Essentially each static out-of-page branch only uses the branch handler
routines once after every context switch, where a new binding is
built.  For fun, I may add a signal handler and timer interrupt
to increment the context switch token.

Now, to play with dynamic (computed) branches.  If branch target
lookups can be done with reasonable efficiency, I think we have
something.

One thing that comes to mind is that I wonder with a limited
code cache (however big we end up making it), how the code
paths which are infrequently used will compete with ones
which are very active, for space.  Our translation will be quite
lightweight, so retranslating is not as bad as it could be,
but perhaps it will pay to associate a hit-count with
each code page, and strictly emulate within that page until
a certain threshold is reached, and/or inline similar instrumentation
in the tcode.

Anyways, I like how things look so far...

======================================================================
Fri Jan 12 15:33:49 EST 2001

The workload before each guest branch instruction was previously only
a NOP.  I changed this to create guest code which had a small tight-loop
section before each out-of-page branch.  I chose to use a cascading add
loop to make sure the CPU was kept busy and didn't parallelize things
(and thus compress time spent in the loop).  You can set the number
of tight-loops with 'DT_MicroLoops' in dt.h.  Some results:

                       Native       DT     factor (DT/Native)
  DT_MicroLoops==  5:   0.89      1.46      1.64
  DT_MicroLoops== 10:   2.02      2.50      1.24
  DT_MicroLoops==100:  17.62     18.08      1.03

This doesn't factor in dynamic (computed) branches, which are used
in dense switch statements, C++ virtual functions (C function pointers),
and (implicitly using the stack) by the return() statement.

Anyways, you can see that as the real workload that is executed
between overhead instructions (like out-of-page branches) increases
(approaching more realistic code), the overhead factor signficantly
improves.  Even low tight-loop iteration counts yield pretty good
performance.

======================================================================
Fri Jan 12 22:53:19 EST 2001

The generated tcode for static out-of-page branches has a token
embedded inline which is backpatched along with the direct tcode
branch address.  When this token matches a global token, the tcode
knows the direct branch address is valid.  Up to now, the global
token was not incremented as it would be in a real VM environment.

The idea is that the global token is incremented each time there
are changes to the page mappings, like a PDBR reload etc.  This
lets the tcode dynamically re-adapt itself to possible page mapping
differences since the last time the code was executed.  For example,
since the last context switch, a code page could have been swapped out
or have been otherwise remapped.  We would not want to execute associated
tcode for that page until we have revalidated that the conditions
under which the tcode was generated are the same for the page.
The inline token enables this check.

The cost of this method is extra storage per static out-of-page
branch, and the execution time of the token compare and eflags
state management.  The upside is the simplicity.  No big tables
or branch graphs are stored, or need invalidation/revalidation
management every guest context switch.

I used the setitimer()/signal() services to simulate a context
switch and for now, just increment the global token.  This
will force all the branches to use the handler routine for the
first time they are executed after the context switch.  The
handler routine backpatches the new token and tcode address
inline.  There will eventually be some constraints checking
on the given codepage.  For now I don't do any.

Anyways some results.  I scaled up the macro loops count.

  execution time    guest timeslice
  14.55             500000 uS   (2Hz)
  14.60              10000 uS (100Hz)

Only an extra 0.3% overhead for a higher frequency context switch,
on top of the overhead imposed by the extra code in the branch tcode
talked about previously.  In other words, the factors listed in
the previous section already include most of the overhead.  The
extra revalidation doesn't weight that heavily.  There will be more
overhead with a real VM system, but it's comforting.  I think this area
is where a lot of user-space only DT strategies that use more complicated
flow graphs, would run into trouble.  User space DT efforts can make
assumptions about the consistency of linear->physical mappings across
context switches since the OS takes care of paging automagically.  There's
only a few system calls you have to watch out for and dump tcode.

But a system oriented DT strategy can not make these assumptions.  So
we are faced with following choices:

  1) Use direct branches to target tcode.  Maintain branch trees of
     some sort.  There would have to be a certain amount of maintenance
     involved with fixing up the use of direct branches every context
     switch.  This overhead gets magnified with increases in the
     guest context switch frequency (using 1000Hz instead of 100Hz),
     and with clock skewing to keep the guest time reference in sync
     with the host.  It is also more complicated and requires more
     memory.  The direct branches would be much faster, but I'm not
     sure how this will balance out with the extra context switch
     burden.  Note that with higher loads on the host, more clock
     skewing needs to be applied to the guest which magnifies the
     effective guest context switch frequency.  Thus this method
     gets incrementally worse with higher host loads.

  2) Always generate a call to a branch handler.  This is simple,
     but slower.

  3) Generate a simple check inline, then use the direct branch most
     of the time.

I chose #3.  Admittedly, it strives for "pretty good" rather than
"great" or "excellent".  A decent balance of performance, simplicity,
and scalability to host load.

Working backwards, now it's a little easier to explain why the
methods I proposed in the Plex86 Internals Guide (PIG) are
page oriented - because of this context switch dynamic tcode address
revalidation process.  It's also why there are constraints (or
perhaps just a constraints ID) notated on the meta information
for each page, on the included graphics.

======================================================================
Sun Jan 14 20:57:15 EST 2001

Dynamic branches.  I hacked some guest code which resembles
a dense switch statement like:

  for (macro_loops=DT_MacroLoops; macro_loop>=0; macro_loops--) {
    for (s=31; s>0; s--) {
      switch () {
        case 0:  WORKLOAD(); break;
        case 1:  WORKLOAD(); break;
        case 2:  WORKLOAD(); break;
        ...
        case 31: WORKLOAD(); break;
        }
      }
    }

The inner loop races through the case targets.  The outer loop just
repeats the inner loop.  WORKLOAD() can be varied to be a NOP instruction,
or a repeating (variable by DT_MicroLoops) add cascade code block to
keep the CPU busy.

My hand coded guest uses a branch table lookup, like a compiler
would for a dense target.  Such computed branches are worse than
static targets, since they always have to be computed.

My first effort generates DT code which always calls the branch
handler assembly routine which always saves all the guest state,
calls the C function, and restores guest state.  This is not
nearly optimal; an initial hash table lookup could be coded inline
(with the downside of code bloat), or in the assembly shim before
all state is saved.  The static branch case didn't force me to do
that yet, because once the target was found, the direct address was
backpatched.  So the suboptimal handler case was not used enough
to matter.

Here are the results of the first effort:

  workload      microloops    native      DT     factor(DT/native)
  ----------------------------------------------------------------
  NOP                           .52      9.69    18.6
  add cascade     5            1.87     11.08     5.9
  add cascade    10            3.59     12.83     3.6
  add cascade   100           27.24     36.43     1.3

As you can see, always diverting branch lookups through the
C handler code is not very efficient.  It takes too many non-overhead
instructions (workload) to average out the cost of the expensive computed
branch handling.

Fortunately, the initial hash table lookup and single cache line
oriented search can be done in an assembly shim quite simply.
If there is a miss, then the C code can be called.  I'll try
that next for a second effort.  I won't be able to get the
factor down as low as with static branches; computed branches
are more of a worse-case scenario.

======================================================================
Tue Jan 16 12:17:09 EST 2001

For a second effort at dynamic branches, I handcoded an assembly
handler which performs the target address hash function and then
searches each of four entries in the set.  If no match is found,
the C handler code is then called.

======================================================================
Fri Jan 19 11:46:43 EST 2001

Oops, after adding a little instrumentation, I realized I had
a bug.  Fixed that and reran to get a new set of the numbers,
than the ones I posted.  New results follow; they're close
to the ones before.


                                              (DT/native)
                                           2nd effort  1st effort
  workload      microloops  native    DT   factor      factor
  ----------------------------------------------------------------
  NOP                         .52    6.39   12.3       18.6
  add cascade     5          1.87    7.55    4.0        5.9
  add cascade    10          3.59    9.43    2.6        3.6
  add cascade   100         27.24   33.07    1.2        1.3

These numbers show a little better performance for dynamic branches.
It took only about 5 loops through the add cascade to average out
the slow-down factor to a similar level as 10 loops in the previous
table.

I suppose some more gain could be had from inlining at least
the first hash table entry lookup/compare in the generated
code, rather than do it all in the handler, but that means
more code bloat.  So, good enough for now.  With some real
instrumentation data gathered running real code in the future, it
should be more evident what the code cache size constraints are.  And
how increasing the generated code footprint affects the native CPU
caches.

That's about all of the non 1:1 instructions I'm interested
in working on.  There's some others we can translate rather
than trap on, like selector reads (setting the RPL bits as
expected), eflags reads and such, but they're not that incredibly
interesting or complex to handle.  Branches are the main concern.

Onward to hash out some other architectural details.  I'll manage
a small TODO list.  When that list is depleted, it'll be time to
hammer this stuff into plex86.

======================================================================
Sat Jan 20 23:27:06 EST 2001

Ramon reorganzied things so that he could play with more
advanced DT techniques in parallel to the mainstream lightweight
DT development.

======================================================================
Tue Jan 23 19:17:15 EST 2001

Ah, much better!  The previous rev required backpatching to
be off to test dynamic branches.  Ramon had noticed this and
made a fix to allow static branches backpatching to work
along with the dynamic branch code, like it should.  There
were also static branches in the test guest code, so not
allowing the dynamic tcode address backpatching was dragging
the performance down.  It now takes much less actual workload
to cost average the price of dynamic branch handling technique.

                                          (DT/native)
                                           3nd effort
  workload      microloops  native    DT   factor
  ----------------------------------------------------------------
  NOP                         .52    2.68    5.2
  add cascade     5          1.87    4.08    2.2
  add cascade    10          3.59    5.76    1.6
  add cascade   100         27.24   29.35    1.1
