		Things To Do in CRM114 (rough priority list)



1) redirect OUTPUT  (DONE!)  "output <flags> (filename) /interp_val/

2) write WINDOW  (timeout) /dropregex/ /readregex/ drops to end of dropregex,
   then reads in till readregex or timeout.  (DONE!  Well, at least done enough
   to parse a syslog.) (DONE!)

6) get MATCH to bridge end-of-line. (DONE!)

8) get MATCH to work on fromnext/fromend/tillnew  ( DONE! )

4) write ALTER  ( DONE!!)

12) make <flags> and (vars) optional, rather than having matches 
    wierdly succeed if they aren't present.  ( DONE! )

3) write LEARN and CLASSIFY  (DONE!)

9) get argv (or the command line) and the environmment into the VHT table.
   ( DONE  !!!)

15) if WINDOW is first statement, don't suck in till EOF, let window
    do the work. (  DONE!!!)

16) let WINDOW check for match by character, newline, or eof (flags
    (with flags "bychar", "byline", and "byeof").  (DONE except for 
     some skankyness when doing bychar on a TTY)

5) allow MATCH on variables or other files.  (: means var, otherwise 
   it's a file?  Or should we have an explicit flag for other-than-input?)
   ( DONE!!! )

13) write ISOLATE  (DONE !!!)

7) get EXEC to run other executables (see netcat for help)  (DONE!!!)

18) fix cdw/tdw bug in ALTER.  ( DONE !!!)

19) fix EOF bug in SYSCALL  ( DONE !!)

20) get rid of all those "unused variable" warnings.  ( DONE!!! )

**** prealpha release here ****  (DONE!!!)

* Require that ';' be escaped unless it's an end-of-statement.  ( DONE!!!)

* Have preprocessor do "#insert"s.    ( DONE!!! )

* Make Quick Reference Card .txt (  DONE!!!)

* Write ReadMe  ( DONE  !!!)

*****  alpha release here ******

10) allow command-line program input {execute anything inside curly braces}
    ( DONE!!! )

17) fix memory leak in INPUT...  make it an ALTER.  ( DONE!!! )  

* Make a reasonable way to to keep an "SYSCALL" process around, so the exec
  process retains useful state.  (DONE!!!)

* Fix bug in output so that a /'ed filename isn't taken as the output 
  pattern.  ( DONE !!!!)

* Fix bug in output to allow appending to a file with a  " +" on the filename.
  ( DONE!!! )

* Write up "N useful idioms" in ReadMe  ( DONE !!! )

***** FRESHMEAT release here *****

* have previous match start and end points be part of the variable being
  matched against, not against the buffer (cdw or tdw) being matched 
  against. ( DONE!!!)

* maybe make a builtin variable for the entire data window, maybe 
  call it :_dw: ?  Make :_dw: the default matched
  variable.   ( DONE !!!)

* bugfix: change :*: to evaluate _once_, not iteratively... and have it
  evaluate unset variables to their valu sans :'s.  (DONE!!!)

* merge cdw and tdw into one big string storage space.  This would
  make the autosizing problem somewhat easier. (DEFERRED -maybe not -
  will really slow down string alterations)

* bugfix - should give a better error msg than SEGFAULTing when
  SYSCALLing with zero or one (paren) argument. (DONE!!!)

* bugfix - if you use a '[' ( including [[:foo:]] patterns) in the
  match pattern, the '[' gets sucked in as a sub=space match.  The
  temporary workaround is to always specify a subspace match first, but
  that's a kludge.  The right fix is to do a better microcompiler parser.
  (DONE!!!)

* let INPUT specify a file, rather than the kludge of "cat myfile.txt",
  which is also a kludge.  (DONE!!!)

* Add "union" and "intersection" operators (partially as a workaround of 
  the bug in the Gnu Regex library that causes patterns of the form
  ".*literal" to fail when line-spanning is turned on and the searched
  string is more than 20479 characters long (ask how I found that bug!)
  (DONE!!!)

* Change :_data_window_: to be :_dw:, newline to :_nl:, etc...  (DONE!!!)

* Make a real Bison-type parser (optional).  (SORTA DONE!)

8) come up with generalized flag finder for match, output, etc.
   (nonpositional parser?)  Issue: what about giving an error on a 
   nonrecognized flag?  The current nonpositional finder just ignores them.
   ( DONE!!! )

21) make INPUT have consistent behavior across both stdin and files (i.e.
    always read to EOF unless by-line is requested)  (DONE!!!)

22) clean up SYSCALL and OUTPUT so that () is variables and [] is outputs.
    (low priority)  (CANCELLED - think on this more)

24) have EXIT return an output value.(DONE!!!!)

25) extend the mail filter to pop open BASE64 stuff before doing a classify.
    ( DONE !!! )

30) Go to a full Bayesian match rather than a raw score match.  ( DONE !!! )

*) have the gz file unpack into a subdirectory "install", not in the
 current dir. (DONE!!!)

32) fix OUTPUT so the <append> flag does what + does now.  (DONE !!!)

33) add a steppable tracer/debugger.  Or at least the ability to
  single-step if desired.  (-t, -t nnn = run nnn statements, then go into
  single step mode.  In singlestep, p /whatever/ pretty-prints
  whatever, a :var: /whatever/ alters var, n single-steps the next
  statement, c continues execution at full speed, d / D toggle user
  and system debugging, any decimal number runs that many more
  statements, h gives you help.  ( DONE !!! )


35) 8-bit-safe the data paths  ( DONE EXCEPT FOR THE REGEX ENGINE )  (DONE!!!)

34) Decide to change #insert to just plain "insert" .  (***DONE***)

39) add "kalman-belly-gaze" mode to autotrain on some fraction of the 
processed but unverified text.  Definitely need to flag this forward
to the user so the user can detrain.  (***DONE***)

31) add ELSE and BBREAK  (else flips sense of skip when encountered, BBREAK
  skips immediately to closing Bracket without doing any ELSEstuff).  Maybe
  need to think about this some more?  (****DONE by ALIUS statement ****)

45) add a fault catcher along the lines of ALIUS for catchable errors
    (***DONE**** with FAULT and TRAP)

44) add options to cssutil to rescale overflowing bins.  (***DONE***)

47) add a third error class beyond FATAL - APOCALYPTIC errors, which are
so bad that the trap system can't gaurantee a clean recovery. (***DONE***)

41) add a third class/result from the classifier: "Neither", where the
number of hits in both categories is lower than would be statistically
expected (i.e. it's an email spam that's being intentionally evasive).
(*** SUBSUMED *** - see #48)

48) have a FAULT-NEITHER flag that compares the number of hits
expected in the two categories to the actual number encountered.  If
the actual is way below the expected, throw a fault.  (*** SUBSUMED *** -
the :stat: variable now has the actual feature counts, and so the user
can make their own decision)

49) have the preprocessor set the line number so that sometnhing more 
than "line 0" can be reported.  (if possible).   (***DONE***)

51) upgrade the .css files to have 32-bit ceilings and crosscut hashing to
minimize hash-clash errors. (***DONE***)

37) Mailfilter: Add transforms to insert fake phrases based on heuristic
(spamassassin like) features for things like lots of html, base64
armoring, so features can be seen in CLASSIFY. (*** DONE *** via rewrites.mfp)

50) fix TRAP so that it doesn't run the trap statement until
it's sure that it can catch the trap.  (in other words, don't run the statement
evaluator.  The reason for this is that nonfatals which would have been 
simply continued from the current statement now don't trap right.)
(*** DONE ***)

47) have BASE64 expansion and spammus interruptus decommenting happen
only if they're successful, otherwise they don't modify the matched text
variable m_text.  (***DONE***)

52) set up "extra stuff" so it's an attachment rather than just some text
stuck on the end.  Use cribbed formatting... (***DONE***)

53) let WINDOW work on any var, not just :_dw:.  (***DONE***)

43) make CLASSIFY run against multiple categories, not just go/nogo.
(separate categories with "|" and allow multiple .css files per category)
(***DONE***)

57) HASH should take :_dw: by default (***DONE***)

11) decide if locations still need to be :whatever: (***YES*** that way,
    new statements can be added without problems)  (***DONE***)

46) get EVAL working properly (evaluates both sides till no more :*:, then
does the assignment)  (***DONE***)

54) let WINDOW take input from anywhere, not just stdin (i.e. from a
file?  from a variable?)  (***DONE***)

58) WINDOW should take an input (:var:) as source and modify a [:var:].
See #54  (***DONE***) 

23) let INPUT keep around a file that's already been partially read. 
    (*** DONE *** - well, obviated, because you can now do random access
    reads and writes)

59) Put in "fseek" operators into INPUT and OUTPUT - but this needs
a design - how do you delineate the fseeks?  How do you have them 
compatible with substring operators in EVAL? (***DONE***)

61) Come up with useful semantics of CALL, ROUTINE, and RETURN.  Should
variables be mapped in, out, both?  What about overlaps?  (***DONE***)

56) create an "emacs mode" for CRM114 programs (***DONE***, thanks to
a contributor)

67) finish chapter in the book on classifiers (***DONE***)

69) refactor crm_expandvar.c into a single stream  (***DONE***)

70) incorporate ** or $ or % or somesuch double-dereference (i.e. this
var holds the name of the var that holds the actual data - needed for 
subroutines to have call-by-name work right)  (***DONE***  - the :+:
now does this)

73) Refactor memory mapping so that it's all in one place/cached/easy
to retarget to other OSes.  (***DONE***)





*********************************************************************
************* Stuff not done yet below this line ********************
*********************************************************************


14) write PERM (for now though, you can just exec cat >> them into a file, 
    then suck 'em back in again.. so this isn't really all that much more
    powerful.)    (Held in consideration...)

26) consider making the sizes of data windows and pattern buffers autosizing,
    so you can never "run out".

27) let INPUT use regexes to determine "end of input". 

28)  Add the proper hooks so that GDB et al can look at the executing process.

29)  add time flags to syscall for timeouts; should have both absolute
  time and delta time since last successful output; unless <async> is
  specified, looping continues until EOF occurs after waiting at least
  the timeout.  Suggested values: to01, to1, to10, to100 (.1, 1, 10,
  100 seconds total time), and dto01, dto1, dto10, dto100 (delta since
  last output of .1, 1, 10, 100 seconds.)  Or come up with a way to
  cleanly parameterize this.

36) 8-bit-safe the program variable names

38) Mailfilter: send bounce messages when mailfilter rejects something???
 
40)  use 1/tokencount and 1-(1/tokencount) for probabilities.
(or something from the Gary Robinson article)

  Witten-Bell = unseen words = P(wi|d) = ni+1 / N1 + seen_once

  LMMSE linear mean minimization estimator.

 (this is orthogonal to the switch to a Markov model)

42) make declensional parser aware of statement type, so it parses
(and errors) on what it expects, rather than the general case of everything.
(** half done - the flags do checking this way **)

55) create a setup "wizard" for creating mailfilterconfig.crm & .forward

60) Do "virtual variables" that are matched on-the-fly have any use?
Or are they merely providing functionality that already exists?  (i.e.
  DEFINE (:foo:) [:bar: 10 1000] /[[:alnum:]]+/

62) Add comment stripping to optimize execution speed.  Use "-o" to turn
this on. (trade off longer compile time for faster run time).

63) Add -vv to print out compile-time configuration information.

64) modify crm_setvar so it takes an optional VHT index, to save 
double lookups.

65) Abstract out mapping files & add cacheing.

66) add a marker that the flags have been parsed & didn't contain a
    variable & so reuse old flag values (JIT improvement).

68) rewrite Markovian & OSB classifiers into uniform framework

71) Refactor/rewrite mailfilter.crm

72) Make full CAMRAM in native mode work (window size limit?)

