		The CRM114 & Mailfilter HOWTO

		    -Bill Yerazunis, 2003-09-18
			(last update 2007-01-01)


This is the CRM114 Mailfilter HOWTO.  It describes how to set up CRM114 
and Mailfilter to filter your incoming mail, as of the version 
CRM114-20060209-ReaverSecondBreakfast.

This HOWTO doesn't describe _how_ CRM114, Mailfilter, Mailtrainer, or
Mailreaver works.  This just will set you up enough so that you can
start using CRM114 and Mailfilter to filter your mail.  It assumes you
are running on a Linux box; getting the system running on *BSD, MacOS,
or Windows will require considerably more work than we describe here
(and is a subject for future HOWTOs).

    ------------------------------------------------------

      Remember, the CRM114 package is released under the GPL (license
      is enclosed in any of the downloads).  There is NO WARRANTY
      WHATSOEVER for this software to be useful in any way; it's going
      to tamper with your incoming mail and you can easily imagine the
      dangers in that.
    
   ----------------------------------------------------------

That said, I hope CRM114, Mailreaver, and Mailreaver is useful to you;
it's been very useful to me.  It's been keeping my mailbox clear of
clutter for since 2002; I'm convinced it has better performance than
I-the-human at killing spam without accidentally deleting important
mail.  I've tested myself, and I-the-human is only about 99.7% or
99.8% accurate at best; CRM114 is considerably more accurate than that
- easily two or three times more accurate.  (as of December 2003, it
was 99.95% accurate (N+1 statistics) on my incoming mail stream to a
non-business account.

Something to Remember:

    CRM114 is a *language* designed to write text filters and
    classifiers in.  It makes it easy to tweak code.

    Mailfilter is just _one_ of the possible filters; there are many
    more out there and if Mailfilter doesn't do what you want, it's
    easy to create one that does.  

    Mailreaver is another one of the filters, with different (and
    better, I hope) designs, that can use Mailtrainer (yet another
    filter) to build even better statistics files.

    There are yet other filters written in CRM114; you can read all
    about them on the web page:

	  crm114.sourceforge.net

    (and if you create one, and want to share it, put it on a web
    page and send me an email so I can add a pointer.)

	     - Bill Yerazunis (wsy@merl.com)

-------------------------------------------------------------------

	Step 0:  Scientes Inamicae  (Know Thy Enemy)

These are the major steps in using CRM114 Mailfilter.  The steps are
pretty simple:

      1) Downloading what you need  

	 (it's just 1 or 2 megabytes in a single .gz file)

      2) Setting up the executables

	 (not more than ten commands to type, even if you're building
	 from the fresh source)

      3) Configuring Mailfilter or Mailreaver

         (editing one file, most likely change is ONE line, and we tell
	 you which one)

      3) Setting up the needed auxilliary files 

	 (not more than 2 files to edit of no more than 5 lines each,
	 plus typing one or two commands)

      5) Engaging Mailfilter

	 (if you are using Procmail, this is cut-and-paste about ten lines,
	 otherwise it's create one file containing one line, and typing
	 up to three commands)

      6) Training CRM114 and Mailfilter

	 (whenever you get an error, you send it back to yourself,
	 using your current mail tool.  How hard can that be?  Now, you
	 can also use mailtrainer to bulk-train in whole directories
	 of your old spam and good email.)

      7) Adding Priority Lists, Whitelists, and Blacklists

         Mailfilter supports whitelists, blacklists, term rewriting, and
	 some other features.  You can use these for "guaranteed delivery"
	 from people you really trust - or really hate.

      8) Useful Utilities

         Details on the cssutil, cssdiff, and cssmerge utilities.  You
	 don't need to know this, but you may find it useful.



-------------------------------------------------------------------------

                  Step 1: Downloading.

Get yourself a copy of a CRM114 kit.  The kits can always be found by
visiting the CRM114 homepage at:

   http://crm114.sourceforge.net

You will need at least the statically-linked binary kit (compiled to
run on any i386 or better Linux box); for best performance it is
suggested you get the source kit and compile it on the processor you
will be running CRM114 on.  If you do not have root privs on the box
you will be running CRM114 on, it is suggested you stay with the
statically linked binaries (this is because the recommended "TRE" REGEX
library requires either root to install, or major workaround mojo).

The kits are named:

        crm114-<version>.i386.tar.gz   (statically linked binaries)
and
	crm114-<version>.src.tar.gz    (complete source code + tests)

These kit .gz files are fairly small; usually less than one megabyte
(currently around 800 Kbytes) so they will download quickly.

You will need to decide if you will be starting off with a pre-learned
set of .css files (.css means CRM114 Sparse Spectra) or if you will be
creating your own .css files from your own samples of spam and
nonspam.  You can think of a .css file as being a "cerebral memory" of
what a particular kind of mail (good or spam) looks like; .css files
are how CRM114 remembers what spam and good mail look like.  With empty
.css files a CRM114 system acts like a total amnesiac - it has absolutely
no conception of "good" or "bad".

In general, the pre-learned .css files will give you an initially more
accurate filter, but after some use and training the self-created
filter files will catch up with pre-learned files, and then the
two filters will achieve about equal accuracy.  However, there may
be some "glitches" in the mid-term while some edge cases in the
prelearned files are _unlearned_.

If you decide not to take our advice, you will also need to download a
set of pretrained .css files like these:

        crm114-<version>.css.tar.gz

The .css files are rather large; this download may approach 50 megabytes.
(currently it's 8+ megabytes)

Download the kits you will need (at least one of .src.tar.gz or
.i386.tar.gz or .i386.rpm) and then proceed to "Step 2: Setting Up the
Executables"



--------------------------------------------------------------------------




                       Step 2: Setting Up the Executables

In this step, you will install four binaries into your system.  
The four binaries are:

    crm - the CRM114 "compute engine".  It's called "crm" because "crm114"
	  is too hard to type.
    cssutil - the .css file check/verify/edit program
    cssdiff - the .css file diff program
    cssmerge - the .css file merging program

One important point: do NOT install CRM114 or any of it's utilites
setuid or sgid to root.  If you do, that's just an invitation for
someone to utterly hose your system without even trying.  We're not
talking an intentional attack, just an inadvertent command or script
gone wierd could do it.

This is also why we recommend using a _static_ linking  of the
executable, so that a LD_LIBRARY_PATH attack can't falsely insert
a subversive version of a library.

  -----

There are three ways you can set up these executables.  You can:

      a) install with a .rpm kit

      b) install with a .i386.tar.gz   (tarball of statically linked binaries)

      c) install with a .src.tar.gz    (tarball of complete source)

Note 1:
   If you do not have root on the machine you are installing on, you 
   may have some problems during the installation.  You may want to 
   reconsider using the statically linked binaries instead of compiling
   from sources.

  -----

  Step 2 Method A: Installing from .i386.tar.gz        

First, untar the binary release.  Type:
	
	   tar -zxvf crm114-<version>.i386.tar.gz

You should now become root.  If you do not have root on your machine,
you _can_ execute CRM114 programs directly from your home directory,
by changing your $PATH appropriately; see your shell man page for how
to do this for your particular shell (it varies with the shell, so 
I can't tell you here how to do it) and skip to the end of this step.
Or- you can run the binary explicitly from your current directory
by invoking it as ./crm114_tre.

If you're installing, become root, then type:

	    cd crm114-<version>

	    make install_binary_only

This will install the pre-built binaries of CRM1114 and the utilities
into /usr/bin.  This is the default install location for CRM114.  If
you want them installed in a different place, edit the Makefile and
change BINDIR (near the top of the Makefile) to a different
directory.

Note that if you type "make clean" you'll _delete_ your prebuilt
binaries, so don't do that!

Now, you can test your work.  Type

     crm -v

which will cause CRM114 to print out the version of itself you
just installed.  

You can also run a quick "Hello, world!" by typing:

  crm '-{ output /Hello, world!  This is CRM114 version :*:_crm_version: .\n/}'

then hit ^D (end-of-file on *nix).  You;ll get back a response like:

  Hello, world!  This is CRM114 version 20040118-BlameEric .


Congratulations!  You've now completed the installation of CRM114 and
utilities from prebuilt binaries.  Proceed to "Step 3: Setting Up Needed
Files.

  -----

  Step 2 Method B: Compiling from .src.tar.gz (source)

This method is the most complex.  Start by uncompressing and untarring the
big .src.tar.gz with the command:

	   tar -zxvf crm114-<version>.src.tar.gz

Now cd down into the crm114-<version> directory.  You will see many files
here.  

You now have a choice: you can build CRM114 with either the GNU regex
libraries (not recommended, as GNU regex can't handle embedded NULL
bytes and has other issues), or with the TRE regex library
(recommended; this is what you get with the precompiled binary kit).

By default, you will use the TRE regex library; however, this means
you have to build and install TRE.  You can either grab the most
recent version from the TRE homepage at http://laurikari.net/tre, OR
you can use the version that is pre-packaged with your CRM114
download.  (The pre-packaged version is tested against CRM114 and will
have all appropriate patches installed, while the fresh one may have
new features.  Take your choice- it's good stuff either way)

Fortunately, building and installing TRE is easy.  The TRE regex
library will peacefully coexist on the same system as the GNU regex
library.

   Caution: if you are building from sources, you should install
   the TRE regex library ***first***.  TRE is the recommended regex
   library for CRM114 (fewer bugs and more features than Gnu Regex).

To install TRE, become root, then type this ( BIG BIG WARNING - DO NOT
FORGET to tell configure to "--enable-static" ) :

	     cd crm114-<version>

	     cd tre-<tre_version_number>

	     ./configure --enable-static
	     
	     make
	     
	     make install

You have now installed the TRE regex library as /usr/local/lib/libtre .

If you make a mistake and need to rerun the make commands, be aware
that in some versions of TRE, a 'make clean' command will delete test
files that are needed when running the build process again.
Unfortunately, the safest course of action is to delete the TRE source
directory and restore it from the tar ball.

Depending on your choices in static versus dynamic linking, you _may_
need to also add /usr/local/lib to /etc/ld.so.conf, and then run
ldconfig as root.  Or not.  If, during the next steps, you get
annoying messages on the order of "can't find ltre" then this is
the thing to try.

Once TRE is built and installed you can compile CRM114 and the
accompanying utilities (cssutil, cssdiff, and cssmerge).  By default,
CRM114 installs into /usr/bin (_not_ /usr/local/bin - if you want to
change this, change the definition of BINDIR near the top of the
file "Makefile").  Cssutil gives you some insight into the state of a
.css file, cssdiff lets you check the differences between two .css
files, and cssmerge lets you merge two css files.


Change directory back up to the CRM114 directory, then become root,
then (noting that no .configure step is necessary; the CRM114
Makefile is self-contained and presupplied) type:

          cd ..

	  make

This will compile and link the CRM114 executable and the utilities.  
You can test this executable if you want.  Just type:

        make megatest

which will run for about a minute and exercise most of the code paths
inside CRM114.  This tests the version of CRM114 in your local
directory.  Note that this only works if you've installed the TRE
engine.  The GNU regex engine has enough "fascinating behaviors" that
it will get a lot of things wrong; the GNU regex package also doesn't
handle approximate regexes at all, and since those are in the test
set, you'll error out on each of those as well.

If "megatest" reports any differences between the supplied
"megatest_knowngood.log" and your own results, OTHER than on lines tht
say "OK_IF_blahblahblah" results, please file a bug report to me and
we'll figure out what went wrong.

If you are happy with the executable, type

        make install

This will install the executable into /usr/bin/crm (by default).  If 
you want another install location, you can change it in the Makefile.
You can now check to see that the install version by:

	crm -v

and CRM114 will report back the version of the install.

You can also run a quick "Hello, world!" by typing:

  crm '-{ output /Hello, world!  This is CRM114 version :*:_crm_version: .\n/}'

then hit ^D (end-of-file on *nix).  You;ll get back a response like:

  Hello, world!  This is CRM114 version 20040118-BlameEric .


Congratulations!  You've now completed the installation of CRM114 and
utilities from source.  Move on to the next step - "Step 3: Setting Up 
Your .CSS Files" .



------------------------------------------------------------------------
------------------------------------------------------------------------

	Step 3: Configuring Mailfilter or Mailreaver

In this step you will tell Mailfilter or MailReaver what you want it
to do with your mail.  All of the options are controlled by editing
one file, named "mailfilter.cf" .  Mailfilter and MailReaver use most
of the same flags (and all of the same important ones) so both use
the same mailfilter.cf file.

By default, both Mailfilter and Mailreaver look for the file
mailfilter.cf in the initial directory.  If you want to change that,
use "--fileprefix=/some/where/else/" on the command line, so these
filters will look for mailfilter.cf (and the other runtime filtering
files!)  in the "/some/where/else/" directory.  This --fileprefix mode
is handy when you are setting up many users.  (remember to use a final
closing slash on the directory name or you will end up nowhere)

The format of mailfilter.cf itself is pretty simple.

0) blank lines are OK.
1) comments start with a # in column 1.  
2) Anything not a comment is a var setting, in the format:

   :var_to_set: /Value_to_set_goes_here/

All of the user-settable configuration vars have setup lines in
mailfilter.cf, and you only need to change three lines for a "default"
average setup: one is a password you make up, and the other two have
only a few possibilities each, and we list those possibilities for
you.

       The Three Things you MUST do in mailfilter.cf :

1: First, you MUST change the secret password.  This is defined near
   the top of the file.  Your password may contain a-z, A-Z, 0-9, but
   no blanks or punctuation (at least for now).  You _must_ set this
   password to something not easily guessable.  If you don't set it,
   you won't be able to use mailfilter's remote commanding facility.

2: Second, you MUST set whether to use base64 decodes, or not, and if
   so, which decoder your system supports.  Just type the options into
   BASH, one at a time (like "mewdecode <cr>") and use the first one
   that doesn't give you an error message.

3: Third: you MUST set the cache_dupe_command according to whether
   your system supports linking (as in, has an "ln" command; *NIX
   does, but Windows doesn't) or whether full copies of texts need to
   be used in the reaver.

Other than that, everything else in the mailfilter.cf file can be left
alone, at least for initial testing.

At first, you will probably want to leave the "log_to_allmail.txt"
enabled while you get used to CRM114.  Likewise, leave
"log_rejections" set to yes as well; that way you can easily see (with
"less" or "tail") just what is being rejected.  Once you get more
experience with CRM114, you can set these to "no" and not use up disk
space in these "extra safety" logs.

You can skim-read the rest of mailfilter.cf .  There are three
typical cases for most users:

1) If you ARE using Procmail or another filtering MDA:

  --> You probably will NOT need to change any of the other options.  

2) If you ARE NOT using Procmail, but your mail reading program can
   sort out email into folders based on whether the SUBJECT header
   contains the telltale string "ADV:" (most mail readers can do this):

  --> You probably will NOT need to change any of the other options.

3) You are NOT using Procmail, and your mail reading program is "dumb"
   (cannot sort email into folders based on subject line):

  --> You probably will want to define a separate account that will
   recieve all spam caught (otherwise, you'll just get all your spam
   delivered as usual, with additional headers telling you it was
   spam).

   To do this, look down to ":general_fails_to:".  Insert the full
   username@domainname.tld mail address where you want your spam to be
   sent.
   

  Note on mime decoders: There are a number of them available; the defaults
  given in mailfilter.cf may or may not be valid on your system.  Further,
  it may have a different path than the default given in mailfilter.cf.
  Yet further, you may want to load your own, like "normalizemime" (see
  the crm114.sourceforge.net web page for details on the download).

You can also configure the verboseness (or not) of your filtered 
results.  You can go from "no changes" (not even a statistical label
in the headers) to complete results including an expansion of any
base64 texts and HTML decommented strings.  

Feel free to change things to get the look and feel you want; after
all, what good is open source if you don't change it?  :-)

HOWEVER, Please don't muck with variables that aren't in the 
mailfilter.cf file. "You make a mess, you clean it up."  :-(

After making these changes, write out "mailfilter.cf".  You may
later go back and change the configuration options, but the options
as already set are good for most users.  You do not need to do anything
to "load in" the new options, as CRM114 reads them in fresh from the 
file during initialization for each email. 

Now, proceed to "Step 4: Setting Up Other Needed Files" .


--------------------------------------------------------------------
--------------------------------------------------------------------


		Step 4: Setting Up Other Needed Files

Now that the crm114 language is working, you need to set up your
.css files,  your rewrites.mfp file, and your priolist.mfp file.

All of these files need to exist (either by being there, or by
being symlinked to) the directory where CRM114 will "run in"
when an actual mail comes in.  Usually this is your per-user
directory on the mail server (if your mail server is also your 
home directory, then it's there.).    If this is inconvenient,
you can use the --fileprefix option on the command line to
tell CRM114 to "change over" to a different directory.  The files
that need to be in the home (or --fileprefix) directory are:

    rewrites.mfp     
    spam.css
    nonspam.css
    priolist.mfp
    blacklist.mfp     [ only for mailfilter; mailreaver ignores it ]
    whitelist.mfp     [ only for mailfilter; mailreaver ignores it ]

Here's a quick overview of these files; we'll get into the details
further on.  If you are in a hurry, you can have *empty* files
for all four of the .mfp files and things will still work reasonably
well (and you can upgrade later).  You DO need to create the
proper-sized .css files, though, or you won't be able to classify
email at all (depending on your setup, it may be discarded,
may be returned to sender, or may actually just get mangled and
forwarded.  None of these are a good thing, in the long run)

 --- Summary of each file ---

   [[ rewrites.mfp ]]

The rewrites.mfp file controls how to "rewrite" incoming email so
that your incoming email conforms more closely to what might be
considered "archetypal".  The rewrites.mfp setup is optional;
if you build your own .css files (either from empty files, or 
from corpora) you can actually replace rewrites.mfp with an
empty file; you just won't be able to share your .css files with
anyone else.

   [[ spam.css and nonspam.css ]]

The .css files themselves ( CRM114 Sparse Spectra files) are the
"memory" that crm114 uses to statistically describe the words and
phrases that characterize various kinds of mail.  Although it depends
on the classifier you are using, by default the .css files are in a
hashed binary format and it is not easy (or sometimes, even possible!)
to reconstruct your email from the .css files.  However, it *is*
possible to determine from the .css files if certain words or phrases
have ever been trained into your classifier, so .css files do have
some possible security implications.

DMCA note: CRM114 statitics files are "effectively encrypted"
according to the provisions of the DMCA - all parties are hereby
notified that the copyright owner/author of any particular statistics
file (.css , .cfc, .cor, .cwc, .chp, or other) is the creator of that
file, not the author(s) of CRM114 itelf, and said creator may invoke the
draconian punishments of the DMCA on any party attempting to extract
the encoded information without prior approval.  So there.

  [[ priolist.mfp ]]

The priolist.mfp file is a sequential list of tests to be run; each
test starts with a + or a - (thumbs up or thumbs down), then a regex
pattern; if the pattern matches, the mail is either accepted
unconditionally or sent to the spam bucket unconditionally.  Then, The
blacklist.mfp, and whitelist.mfp are "match this, you're spam" and
"match this, you're good" regex pattern sets.  If this seems
redundant, you're right; all you need is priolist.mfp, but enough
folks have historically requested "blacklists" and "whitelists" as an
explicit marketing checkoff that we've put them into mailfilter.crm.
Priolist.mfp is the preferred method of doing blacklists and
whitelists now; if a P.H.B. asks "does it have blacklists and
whitelists", you can now say "yes, and they're even _prioritized_
blacklists and whitelists!".



       Step 4 Part 1 - Setting up the Rewrites file.

To set up the rewrites.mfp file, edit the file "rewrites.mfp" and
replace the placeholders (in this case, "wsy", "merl.com", and
"mail.merl.com") with your corresponding username, domain name, and
mail server information.  These rewrite rules will be used to "scrub"
your sample text of user-specific strings.  (note that this is only
strictly necessary if you want to use the pre-built .css files.
However, it is in general recommended, so that you can "share/merge"
your .css files with your friends.)

Note the "arrowheads" in the file.  They look like this:

     >->

or

     >-->

This is a rewrite operator.  Anything that matches the regex on the
left-hand side of the arrowhead will be replaced with the text on
the right-hand side of the arrowhead.  (the "arrowheads" that have 
one hyphen in them will rewrite only if the entire left-hand match
is found on a single line; if you use two hyphens, to make a ">-->"
instead of ">->" then the left-hand match can be multi-lined.)

Example: if your name was Agent Smith, your email account 
AgentSmith@the.matrix.net, and your mail router was mail.matrix.net at
IP address 192.168.10.5, then the rewrites.mfp file should look like:

  AgentSmith@the.matrix.net>->MyEmailAddress 
  [[:space:]]Agent Smith>-> MyEmailName
  mail.matrix.net>->MyLocalMailRouter
  192.168.10.5>->MyLocalMailRouterIP

The idea is to turn your email headers into headers that don't refer
to any of your own actual name, address, etc, but contain only the
strings "MyEmailAddress", "MyEmailName", "MyLocalMailRouter", and
"MyLocalMailRouterIP".

If you have more than one incoming email name , email address, server,
router, etc, add lines in rewrites.mfp for each email name, email
address, server, router, and so forth.  This is something you really
_should_ do, if you have more than one email path leading to the
account that leads to an account that is being filtered by CRM114 (if
you don't, a lot of learning will have to be repeated for each path,
which will cost you accuracy and use up valuable feature slots in the
.css files that you could use in more valuable ways otherwise.  On the
other hand, if you have multiple email addresses that all channel
through one CRM114 fileset, and the addresses recieve very different
ratios of spam and nonspam (or, very differnt *types* of spam), then
it _might_ be to your advantage to not use rewrites.mfp, (just replace
it with an empty file), so that the extra statistical information of
the incoming email address is not lost)

If all this confuses you to no end, just make rewrites.mfp be an
empty file and everything should decently well.

       -----

       Step 4 Part 2 - Setting up the .CSS files


You have a choice here.  You can either build your own files from your
own spam and nonspam email, or you can use the pre-learned .css files
available from crm114.sourceforge.net . We recommend that you build
your own files dynamically, as that will result in the best final
accuracy.

In either case your .css files should be in the same directory as your
mailfilter will "run" in (as we mentioned above, default is your home
directory on your mailserver).  The particular directory that the
mailfilter "runs" in is variable and depends on your local setup.
Assuming you will use the ".forward" hook, there are two likely
situations.

If your mail service runs on your local machine (say, you have just
one machine - and I do hope you have a firewall in that case), then
mailfilter will almost certainly "run" in your home directory- the 
directory you're in when you log in.

If your mail service runs on a mail server (not your local machine),
then you will probably have a "home directory" on that machine as well,
and that's the directory that the mail filter will run in.

If neither of these is the case, you should ask your system 
administrator what the correct directory is.

  -----

  Step 4 Part 2 Method A - Build Your Own Empty .CSS Files

This method will give you the best final accuracy, but you will
spend more time training.  

This is the recommended method for users wanting the best accuracy.

To start from scratch, you need to create empty .css files.  The
cssutil program will do that for you.  Just type:

   cssutil -b -r spam.css
   
   cssutil -b -r nonspam.css

and you will have created _empty_ spam.css and nonspam.css files in
your current directory (that is, the files are full-size, but contain
no information.  They'll be full of binary zeroes). 

Once you have these empty files you will have a high (50% or so)
error rate for the first few hours, till you have 'taught' CRM114
what your particular mix of spam and nonspam looks like.  Proceed 
below to "Step 4: Configuring Mailfilter".

Many people want to "preload" their spam collection into CRM114.  This
used to be a bad idea.  CRM114 is optimized for TOE learning - "Train
Only Errors" learning; testing something like a quarter of a million
test cases has proven that it is better to train only errors, and
_only_ _as_ _they_ _occur_, than to preload a bulk database into
CRM114.

Note that the previous paragraph says "used to be".  The new program
"mailtrainer.crm" can do rapid TOE or DSTTTR training and build your 
.css files out of stored spam and good mail collections.  You can read
all about mailtrainer.crm in Appendix 1 of this document.

If you're wondering, the statistics from the "torture test" (about
40,000 messages) are that training _only_ errors, in realtime, will
give about 2.1 times better accuracy than force-training a big corpus,
even if the messages are the same messages and presented in the same
order.  The "why" is mathematically complicated, but there's an
intuitive description in the FAQ.

Again: you will achieve the best possible accuracy if you let CRM114
itself make errors that you correct in real time.

  -----

  Step 4 Part 2 Method B - Pre-LEARNed files:

This is the simplest method, but less accurate than method A.

If you choose to use the pre-learned .css files, you need to download
the appropriate crm114 .css.tar.gz file, and then you can just type:

   tar -zxvf crm114-<version_number>.css.tar.gz

and you'll get the two files "spam.css" and "nonspam.css" in your
current directory.

Note that the download is fairly large - between 8 and 50 megabytes,
and although this will give you a good starting point for your own
statistics, you will have a better (smaller, faster) final
configuration if you build your own .css files from scratch.


  -----  

    Step 4 Part 2 Method C - BETA TEST - Using mailtrainer.crm to
    Build .CSS Files

New in 20060101 is the "mailtrainer.crm" program.  This program
accepts two directories of "archetype" good and spam email, and runs
an interative training procedure to produce some very high quality
.css files from these examples.  The example files need to be 
"SMTP Virgin" - that is, exactly what was recieved at SMTP time 
by your mail server, with _nothing_ changed.  (any changes 
will affect accuracy, probably negatively)

The mailtrainer training will typically take something like 1 to 10
minutes per 1000 messages in your training set.  Mailtrainer.crm will
create your spam.css and nonspam.css files automatically.

Mailtrainer.crm will also read your mailfilter.cf configuration file,
and rewrites.mfp, so be sure to set up those files _first_ (if you're
doing things in order, you're in good shape).

The full description of how to use mailtrainer.crm is in Appendix 1 at
the end of this document.  So, jump there, read Appendix 1, run
mailtrainer.crm, and then proceed to the next section- checking your
.css files.

  -----

      Step 4 Part 2 Method D - ALPHA TEST -- MAKEFILE Build And
        Preload .CSS Files From Fresh Spam and Nonspam

 CAUTION - this applies ONLY to kits 20060606 and later!!!  DO NOT DO
 THIS if you are running a pre-20060606 makefile!  It will hose you!


If you, by any chance, happen to have un-altered examples of spam and
nonspam, you can use these to pre-build a set of .css files.  (As of
versions 20060606 and later ONLY.  Previous versions had a bad
implementation of this that took different arguments and tended to
produce bloated .css files that didn't function well.  Post 20060606,
the mailtrainer system is used and that works very well indeed)

You also need to be sure your emails are "SMTP Virgin" - that is, they
are exactly as recieved at SMTP time, not with headers or footers
added or taken out by your mail delivery agent or your mail reading
program.  (if this isn't true, the headers will be rather bogus and
you will lose significant accuracy and you should use method A above
instead).

If you are OK with this, here's what to do:

 1) Put copies (or symlinks/hardlinks) to all of your example 
 spam into a subdirectory named spam.dir in the local directory.

 2) Put copies (or symlinks/hardlinks) to all of your example
 good email into a subdirectory named good.dir in the local directory.

 3) IF you want to train from scratch (not necessarily good or bad,
 but your option... choose well):

	 rm -rf spam.css

	 rm -rf nonspam.css

 4) Invoke the mailtrainer

	 make cssfiles

 to build your new spam.css and nonspam.css files.

That's all.  It'll take a few minutes to run but mailtrainer will give
you running status so it's not like things have hung.

Again, let me emphasize that doing this is ONLY recommended on full
installs post 20060606 .  Versions prior to that will hose you if
you do this.

 --------

  Step 4 Part 3 - Checking your installation

Once you have set up mailfilter.cf, rewrites.mfp, the *list.mfp files,
and the .css files, you can test your configuration by typing the
following (The '^D' at the end is a control-D, which is an END-OF-FILE
on Linux.  Other systems may use a different END-OF-FILE character):

    ./mailfilter.crm 
    This is a test.  Just type a few lines of text
    that you might ordinarily get, like a short rant on why
    Perl is useless for big projects, or why Linux is
    superior or inferior to NetBSD.
    ^D

or (to use mailreaver instead)

    ./mailreaver.crm 
    This is a test.  Just type a few lines of text
    that you might ordinarily get, like a short rant on why
    Perl is useless for big projects, or why Linux is
    superior or inferior to NetBSD.
    ^D


If you have set up Mailfilter for Procmail-style filtering you will
always get a small report back saying something like either of these
(the actual numbers and some minor text strings will change, but you
should have something that _vaguely_ looks like the following):

  From foo@bar  Thu Sep 18 19:20:35 2003
  X-CRM114-Status: Good  ( pR: 12.630237  )

   ** ACCEPT: CRM114 PASS SBPH/BCR TEST** 
  Probabilistic match quality: 1.000000, pR: 12.630237 
  P(succ): 1.000000e-00, P(fail): 2.342950e-13 
  Features: 336, S hits : 4313, F hits : 5901 
 
or:

  From foo@bar  Thu Sep 18 19:19:39 2003
  X-CRM114-Status: SPAM  ( pR: -2.866484  )

   ** REJECT: CRM114 FAIL SBPH/BCR TEST** 
  Probabilistic match quality: 0.001358, pR: -2.866484 
  P(succ): 1.358082e-03, P(fail): 9.986419e-01 
  Features: 144, S hits : 2337, F hits : 3313 
 
If you are using "mail to spamtrap account" filtering, then you will
either get an "accept" report back (the first report above is an
"accept") or the text you typed in will be mailed to your spamtrap
address.  If you don't get a report back, check the spamtrap address
and see if your test text ended up there.

If all the numbers are zero, or the result is "UNSURE", that's OK,
it just means there isn't enough statistical information in the .css
files yet to actually decide if it's spam or not.  This is a good 
situation.

If you don't get _either_ of the above, something is broken, either in
your installation of CRM114 or in your configuration file.  You need
to fix the problem before you engage Mailfilter.

If your installation and configuration passes the above test,
congratulations!  You have now configured mailfilter.crm .

  -----

  Step 4 Part 4 - OPTIONAL - CHECKING YOUR .CSS FILES

For all three (four?) methods of setting up your .css files, you can
check that the .css files are reasonable.  Use the "cssutil" utility.

Note: this works fine for the default classifiers like Markov, OSB,
and OSB Unique, but _not_ for Winnow, Hyperspace, or Corellative
classifiers; for OSBF classifiers use osbf-util instead of cssutil.

Type in:

    cssutil -b -r spam.css
    
    cssutil -b -r nonspam.css

You should get back a report something like this:

     Sparse spectra file spam.css statistics: 

     Total available buckets          :      1048576 
     Total buckets in use             :       506987  
     Total hashed datums in file      :      1605968
     Average datums per bucket        :         3.17
     Maximum length of overflow chain :           39  
     Average length of overflow chain :         1.84 
     Average packing density          :         0.48

Note that the packing density is 0.48; this means that this .css file
is about half full of features.  Once the packing density gets above
about 0.9, you will notice that CRM114 will take longer to process
text.  The penalty is small below packing densities below about 0.95
and only about a factor of 2 at 0.97 . 

Note - do NOT believe "ls -la" with respect to .css files!  Because
CRM114 uses memory mapping instead of file I/O (because it's much
faster to go through the page-fault tables than through the file I/O
system), the m-time (time last modified) and c-time (time created)
never change, only the a-time (time last accessed), and that even the
a-time only changes if your file system had the proper compile-time
options to keep track of the a-time, and that defaults to "not keep
track".  Believe in what cssutil tells you- if new features show up
after learning (because the bucket counts change), you _are_ learning
and "ls -la" is lying to you!  Conversely, if the bucket counts
do NOT change, you have a file redirection or file protection 
problem and your system is NOT learning.  That's bad and you need
to figure out the problem and fix it.


You can also see how easy it will be for CRM114 to differentiate
spam from nonspam with your .css files.  The utility "cssdiff" will
compare the statistical features of two .css files. (again, only
for Markov, OSB, and OSB Unique classifiers) Try it:

    cssdiff spam.css nonspam.css

and you'll get back a report like:

   Sparse spectra file spam.css has 1048577 bins total
   Sparse spectra file nonspam.css has 1048577 bins total 

   File 1 total features            :      1605968
   File 2 total features            :      1045152

   Similarities between files       :       142039
   Differences between files        :      1279964

   File 1 dominates file 2          :      1463929
   File 2 dominates file 1          :       903113

Note that there's a big difference between the two files; in this case
there are about 10 times as many differences between the two files as
there are similarities.  That's pretty much typical- and it's a good sign
that your filtering should be quite accurate.

Now, move on to "Step 4: Configuring Mailfilter".

----------------------------------------------------------------------------
----------------------------------------------------------------------------

	Step 5: Engaging Mailfilter

There are two common ways to engage Mailfilter.crm on your incoming
mail stream: you can use Procmail recipes and have Mailfilter run as a
procmail subprocess, or you can use the .forward hook of Sendmail (and
Sendmail clones which also support .forward)

In the first method (recommended), you use Procmail's ability to
execute a program as part of a Procmail recipe to run CRM114, which
adds headers as needed to let Procmail or your mail-reading program do
the sorting.
  
In the .forward method, you (or your system manager) must add a link
from an execution command of crm114 to the directory /etc/smrsh.  This
is because sendmail will NOT run any program that isn't "approved" by
the system manager (by linking it into /etc/smrsh/whatever).  The output
of mailfilter is then directly appended to your /var/spool/mail file
(or possibly forwarded to your spam-bucket account).

  -----

      Step 5 Method A: For Procmail and Maildrop Users

For Procmail users just add a procmail recipe to .procmailrc to run
CRM114 and mailfilter whenever your other procmail rules fail to
decide what to do.

Here's a sample Procmail recipe set.  Notice that we actually have TWO
recipes - one to actually run crm114 and mailfilter, the other to 
then sort the mail based on the result.

   #
   #

   :0fw: .msgid.lock
   | /usr/bin/crm -u /home/my_user_directory mailfilter.crm
  
   :0:
   * ^X-CRM114-Status: SPAM.*
   mail/crm-spam

That's all that Procmail users should need.  Mailfilter should now be
active - send yourself a test message and see where it ends up.

To use mailreaver instead of mailfilter, just put "mailreaver.crm"
in instead of "mailfilter.crm" .

If you get the test message, proceed to "Step 6: Training CRM114".

-----

( note: Sub-Method A-one)

If you use an MUA that can highlight on headers, you can use something
like this in your procmail (from Philipp Weiss):

in .procmailrc

   CRMSCORE=`$HOME/bin/crmstats.sh`
   :0fw: .formail.crm114.lock
     | formail -I "X-CRM114-Score: $CRMSCORE"

where ~/bin/crmstats.sh is a simple script:

   #!/bin/bash
   grep -a -v "^X-CRM114" | \
     /usr/bin/crm -u $HOME/.crm114 mailfilter.crm --stats_only


------

 (note: Sub-Method A-two)

If you're using maildrop ( http://www.courier-mta.org/maildrop.html ), you
can put this in your ~/.mailfilter (from Stefan Seyfried and Joost van Baal)

   CRMSCORE=`grep -a -v "^X-CRM114" | crm -u $HOME/.crm114/ /usr/share/crm114/ma\ilfilter.crm --stats_only`

   xfilter "formail -I \"X-CRM114-Score: $CRMSCORE\""

   if ($CRMSCORE < -1)
   {
       xfilter "formail -I \"X-CRM114-Spam: yes\""
   }

   log "Spam: $CRMSCORE"

   if (/^X-CRM114-Spam: yes/)
   {
       to Mail/spam/inbox
   }

----------------------------------------------------------------------------
----------------------------------------------------------------------------


Advanced Topic: Huge Emails and Denial Of Service Avoidance

CRM114 has a number of built-in anti-Denial-of-Service (anti-DoS)
features; one of them is that it will not grow buffers beyond a
certain limit, No Matter What.  This default maximum is altered
with the -w parameter.

However, you may find that you actually recieve emails bigger than
this limit.  In these cases, it is effective to simply filter on
the first few tens of kilobytes of incoming text; that will speed
things up a lot.

[[ Obsolescence note: CRM114 builds prior to about 20050601 need the
method described below.  After that, mailfilter has the built-in
option :decision_length: in mailfilter.cf which defaults to 16000 chars ]]

This is easy to do with "head".  head -c 10000 gives the first 10,000 
characters of input, which is usually adequate for CRM114 to get a 
good decision on.  This can be directly piped in right in the procmail
command:

   :0fw: .msgid.lock
   | head -c 10000 | /usr/bin/crm -u /home/my_user_directory mailfilter.crm
  
   :0:
   * ^X-CRM114-Status: SPAM.*
   mail/crm-spam



  -----

    Step 5 Method B: The .forward hook file

For .forward hook users you should be aware that you should NOT put a
direct link to crm in /etc/smrsh; since crm can do arbitrary things,
(such as SYSCALL to invoke any command, it'd be like putting BASH
there) you ought to attempt to control the damage as much as possible.

 1) add a link from /etc/smrsh to crm114's executable binary in
   /usr/bin by becoming root and typing:

     cat > /etc/smrsh/crmfilter 
     /usr/bin/crm mailfilter.crm >> /var/spool/mail/your_account_name_here
     ^D
   
 2) add a .forward file to your account by typing:

     cat > .forward
     |/etc/smrsh/crmfilter
     ^D

That's all.  The mailfilter should now be active - send yourself a test
message and see where it ends up.

  ----

Once you have engaged CRM114 mailfilter, you now get to train it to 
recognize spam and nonspam.  Proceed to "Step 6: Training CRM114".

Note: CRM114 contains a design decision that you may have to play 
with.  Instead of doing memory management games, which both consume
significant runtime CPU as well as present a major denial-of-service
opportunity, CRM114 has an upper limit on the window size and it simply
won't exceed that limit (it gives an error message if an incoming
message tries to exceed the limit)

You -can- change the maximum memory limit at runtime with the -w nnnnn
flag; for example, if you want 100 megabytes of memory available, you can 
set that with

    ...  -w 100000000

to set 100,000,000 bytes as the hard limit ceiling on per-buffer memory 
usage.  Actual usage may be about five times that number, as CRM114 does
a buffer-shuffling dance to minimize time spent reclaiming and 
compactifying memory.

---------------------------------------------------------------------------


	Step 6: Training CRM114 and Mailfilter

One of the great strengths of CRM114 Mailfilter is that it has no
preconcieved notions of "spam" and "nonspam".  It _learns_ what you
consider spam, and what you consider nonspam.

For the first few days CRM114 will make a lot of mistakes sorting
spam and nonspam.  It is _very_ important that you train each mistake
back into CRM114, otherwise it will never learn what you consider spam
or nonspam.

You should train in the mistake as quickly as possible.  Start one
morning and try to train every hour for the first few hours at least.
Don't think you're training a computer- pretend you're housebreaking a
new puppy.

You train mistakes right from your mail reader.  There are several ways
to do this.  Note that you can use mailfilter.crm _or_ mailreaver.crm
interchangeably here; the instructions say "mailfilter.crm" but
mailreaver.crm works exactly the same way from the user point of view.

   * Mail-to-Myself with In-Line Commands to retrain  (Method A)
   * shell commands to retrain  (Method B)
   * Mutt direct interface    (Method C)
   * Some Other Interface    (Method D)



Whatever Way You Train : try to train _approximately_ equal amounts of spam and
nonspam.  If you are within 50% one way or the other, performance will
be very good.

If you are running mailfilter.crm:

 Train only errors!  This is called TOE training.  (TOE :== Train Only
 Errors) It's not necessary to train near-misses; experiments show that
 the performance increase on training near misses is miniscule at best,
 and may be negative at times.

If you are running mailreaver.crm:

 Some messages may come through with a header that says "I am unsure
 about this message.  Please train it either way." - so do exactly
 that.  This is one reason mailreaver learns faster than mailtrainer,
 and why it's also more accurate.

It's best for at least the first day or so that you check your mail at
least every hour or so and send training information back to CRM114.
This will help it rapidly converge on a good set of statistics for
your particular mix of spam and nonspam.

It will take several days worth of errors for CRM114's mailfilter to
approach 95% accuracy, and around two weeks to a month to reach 99+
per cent accuracy.  I usually exceed 99.9% accuracy (less than one
error per thousand).



     Step 6 Method A: Mail-to-Myself

The first way is to use the in-line command feature.  Just forward
the mistake back to yourself, with full headers (except edit out any
CRM114-added headers or text).

Just before the first line of the text to be "learned" as spam or
nonspam insert a COMMAND line.  Everything from the command
line to the end of the message will be learned (so edit the text
to remove things you _don't_ want considered indicative of spam/nonspam
nature).

The command line looks like this:

	command <yoursecretpassword> spam
or
	command <yoursecretpassword> nonspam  (for mailfilter.crm)
	command <yoursecretpassword> good  (for mailreaver.crm)

The "c" in "command" must be in column 1, and you must put your 
secret password into the command line.  Don't use the <> brackets,
use JUST your secret password.

Examples: If your secret password was "Ihatespam", then the command line
to learn something as spam would be:

	command Ihatespam spam

and the command to learn something as nonspam would be:

        command Ihatespam nonspam    (for mailfilter.crm users)
or
        command Ihatespam good       (for mailreaver.crm users)

 [[ Mailreaver users: if you have the cache enabled (which is the
 default) and the message you mail to yourself contains an intact SFID
 (Spam Filter ID), either in the Message-Id: field or in the
 X-CRM114-CacheID: field, then you don't need to worry about editing
 the text so that extra headers, footers, etc. are removed.  The
 cached version of the message is saved during the first time the
 message was seen by mailreaver, and so headers, footers, etc. that
 are added by your MDA or MUA or other stuff will NOT affect
 accuracy. ]]


If you are a mailreaver user, you also have a priority system you can
access, either by editing your priolist.mfp file directly or by
sending youself email in the following forms (where mypwd is the
command passworda_regex_pattern is what will be used for priority
matching.  Priority matches can occur in both the headers and body of
the text.)

    command mypwd maxprio +a_regex_pattern      - sets a maximum priority GOOD
    command mypwd maxprio -a_regex_pattern      - sets a maximum priority SPAM
    command mypwd minprio +a_regex_pattern      - sets a maximum priority GOOD
    command mypwd minprio -a_regex_pattern      - sets a maximum priority SPAM
    command mypwd delprio a_regex_pattern       - deletes the first priority
                                                 list entry that fully matches
                                                 the regex pattern





    Step 6 Method B: Shell commands to retrain

   >> For mailfilter users (mailreaver is different - skip to below! <<

The second way to train in spam and nonspam is to use mailfilter.crm's
shell command line options.  When you find a spam that was mistakenly 
accepted as good mail, pipe it through mailfilter.crm with the 
"--learnspam" flag set, like this:

	 bash> mailfilter.crm --learnspam < the_spam.txt

Likewise, if you get an email that was falsely classified as a spam,
pipe it through mailfilter with the "--learnnonspam" flag set, like
this:

         bash> mailfilter.crm --learnnonspam < the_NON_spam.txt

(yes, if you have a scriptable mail reader, you can put these 
functions right on the menu bars somewhere.  Yes, that's a hint.  :) )

 [[ If you are using mailreaver.crm instead of mailfilter.crm, and
 cacheing is enabled, you don't even need to pipe in the full text in,
 all that's needed is either the intact X-CRM114-CacheID: line or the
 Message-ID line containing an intact sfid.  That's another reason to
 switch to mailreaver! :) ]]

              >> For mailreaver.crm users <<

You're in luck, assuming you have taken the default and left cacheing
turned on.  All you need to pipe into mailreaver for training is any
text or text fragment containing an intact X-CRM114-CacheID: line or
the Message-ID line containing an intact sfid; mailreaver will go get
the exact incoming text of the message and train it, so you don't need
to worry about munged headers.

The command looks like this:

    crm mailreaver.crm [options] < some_text.txt

The command options you have available in mailreaver command line are:

   --spam              - train the incoming text as SPAM (if there's a
                         recognizable cacheid, use the cached msg). 

   --good              - train the incoming text as GOOD (if there's a
                         recognizable cacheid, use the cached msg). 

   --cache             - default is to train using the text stored
                         in the reavercache.  Use --cache=NO to *not*
                         use the cached version, if for some reason you
                         don't want to.

   --dontstore         - default is that every incoming message that isn't
                         a training message (that is, --spam or --good) is
                         put into the cache.  Use --dontstore to not put
                         into the cache (for example, "seekrit" users who
                         aren't allowed to train or who might get msgs
                         that you don't want archived).
  
   --stats_only        - Don't do a full report or forwarding, just 
                         report the pR value on stdout.  This is a value
                         between (roughly) -1000 and +1000 where negative
                         values indicate spammyness and positive values
                         indicate goodness.  For a simple test, just look
                         at the first nonblank character.  If it's a "-"
                         sign, the input was spam.  Because there's no
                         other output, --stats_only forces --dontstore.

  --outbound             This message is "outbound" - that is, known to
                         be good.  If it would classify as spam, train
                         and cache it.  Otherwise, no action.

  --undo                 To the extent possible, undo a training with this
                         text (cached will be used if possible).  --undo
                         requires either --spam or --good as well.

  --fileprefix=dir       Assume that the config file "mailfilter.cf" 
                         and the .css files are in directory "dir".
			 Remember to use a final closing slash on the
			 directory name, e.g. /my/home/dir1/ instead
			 of /my/home/dir1.  Otherwise, the filename will
			 be spliced together from the last component
			 of your --fileprefix and the nominal names,
			 and you almost certainly don't want that.
 
  --config=file          Don't use mailfilter.cf as the configuration
                         file; instead use the file so noted.




     Part 6 Method C: For Mutt Users

(Contributed by Mathieu Doidy and Joost van Baal:) 

In your ~/.muttrc, put:

    macro index \es "<pipe-entry>crmlearn --learnspam\n<save-entry>=spam/done\n" \"crm114 learn as spam, save in spam/done"
    macro index \eh "<pipe-entry>crmlearn --learnnonspam\n" "crm114 learn as ham"

where crmlearn is this script

   grep -a -v "^X-CRM114" | \
    /usr/share/crm114/mailfilter.crm -u $HOME/.crm114/ $1 | \
   grep -a "^X-CRM114"

Now you have two new macros in the Mutt index menu:

   * esc-s will tag a message, falsely classified as ham, as spam, 
   * esc-h will tag a message, falsely classified as spam, as ham.



    Part 6 Method D: Some Other Method


There are at least five other ways to retrain CRM114.  Some interface
with common mail readers, some are command line tricks.

Rather than catalog them here (which would quickly go out of date) you 
should go to the CRM114 web page (crm114.sourceforge.net) and browse
the list of applications under "Cool Stuff".  

Some of these are plugins, some are web-based MUAs, and some
are entirely new mail filters. 


      What To Do if CRM114 says "LEARNING UNNECESSARY..."
      ---------------------------------------------------

Occasionally, some CRM114 configurations may refuse to learn an
errror, claiming that it "got it right the first time" (yes, this is a
subtle bug that is not allowing itself to be found, but there is
reason to believe it has to do with the interaction of mail clients
and headers and that some mail readers are lying to the user when they
claim they are forwarding with full headers).

While we applaud this self confidence, the error is still there, so
you need to "force" the learning.  You can do this either from BASH or
from the mail-to-yourself command line.  For BASH, add "--force" to
the command line; for mail-to-yourself commands, just add "force"

From BASH, add --force to the command line:

    # mailfilter.crm < the_error_text --learnspam --force

for mail-to-yourself, add "force" to the command line:

    command mysecretpassword spam force

(and similarly for nonspam).



         The training files "spamtext.txt" and "nonspamtext.txt"
	 ------------------------------------------------------

	 [[  Note: this section is becoming obsoleted by the
	 reavercache, which does more, better, and easier. ]]

Whenever CRM114 learns a new spam or nonspam, it not only modifies
the .css files, but it also keeps the source text of that learning
in the files "spamtext.txt" or "nonspamtext.txt".  

These two files can be considered the "source code" of your .css
files; they're all you really need to rebuild your .css files if/when
you upgrade CRM114 and the .css file is changed but the algorithm is
similar.  For example, upgrading from Markovian filtering (the
default) to Winnow or OSBF is "incompatible", and you might want to
start with these files as a kickstart.

... but not necessarily; some filtering is radically different than
Markovian; as we add new filters as technology moves forward, 
sometimes we will be able to kickstart, and sometimes we can't.

   - for upgrades that can use the current .css files, we will say so;

   - for upgrades that cannot use the current .css files, but *can*
     get kickstarted from spamtext.txt and nonspamtext.txt, we will
     say so;

   - for upgrades that are radically different enough that you must
     relearn from scratch, we will say so (and have you rename your
     old spamtext and nonspamtext files so that they will not be
     accidentally reused.

If your mail system is so short of disk that you cannot afford to keep
these (relatively) small files, then you may either delete them or
symlink these files to /dev/null; you don't absolutely *need* them.
These files are quite small though- I have been running CRM114 for
nearly five years now and my *total* example text sizes are 678 Kbytes
for nonspam and 893 Kbytes for spam (after something like five years
of daily use and about a gigabyte of email).



-----------------------------------------------------------------------



     Step 7: Adding Priority Lists, Whitelists, and Blacklists

If you really want, you can add white, black, and priority lists
to CRM114.  Most people don't need them, but there are always
exceptions.

	[[ Note to mailreaver.crm users - mailreaver.crm uses ONLY
	the priolist.mfp, and does NOT support whitelist.mfp or
	blacklist.mfp.  This really is no loss of functionality,
	because anything you can do with a whitelist or blacklist,
	you can also do with a priolist, and more besides. ]]

For example, your lawyer, your boss, and your paramour all probably
rate being on your "whitelist", so whatever they send to you is always
marked "nonspam".  Likewise, your ex-girlfriend/boyfriend, your
nagging acquaintance, and the stalker from the library should all get
blacklisted.

Whitelisting, blacklisting, and prio-listing are all based on regex
matching.  If the regex you put in the file "whitelist.mfp" matches
the incoming mail _anywhere_, the mail will be marked "good" no matter
how it scores statistically.  Similarly, if the mail matches any regex
in "blacklist.mfp", the mail will be marked as "spam", no matter how
it compares statistically.


Note that sometimes this can cause considerable confusion, for example
"ac.com" in a whitelist will not just match "billing.ac.com", but also
"drac.complete.viagra.sales.com"  (the  match  being the  'ac.com'  in
"drac.complete").  To prevent this, use  ^ and $ to "anchor" the start
and end of the regex, if possible.

Lastly (well, actually firstly, because prio-listing happens before
whitelisting or blacklisting) any mail that matches any regex in
priolist.mfp .  The format of priolist.mfp is that the first character
on the line is a + or a -, which indicates "whitelist" or "blacklist",
and the rest of the line is a regex.  These regexes are tested
in the order given in the file.  An empty file is perfectly acceptable.

For examples of how to set up the whitelist, blacklist, and priolist
files, see the included "whitelist.mfp.example", "blacklist.mfp.example",
and "priolist.mfp.example".

Note: for my accuracy tests, I *turn off* whitelists, blacklists,
and prio-lists.  

Be sure to test any whitelist, blacklist, or other list that you 
add, otherwise you may get a rude surprise some day.


----------------------------------------------------------------


        Step 8: Useful Utilities

You don't _need_ to know the stuff in this section to set up and use
CRM114 and mailfilter or mailreaver, but it might be useful to you- or
at least satisfy some of your curiosity.

There are three utilities for dealing with the .css files (these
are the files that contain the "learned information").

The utilities are:

    cssutil - gives you a readout of the characteristics of the
	      information in a .css file  

    cssdiff - gives you a summary of the differences between two
	      .css files (handy for seeing learning!)

    cssmerge - merges two .css files into one; handy for importing
	      new data into a .css file.  Note that this is 
	      a destructive operation on the first .css file named!


                   The cssutil utility:


Usage is
    
    cssutil somefile.css

which will give you statistics on the file somefile.css.  You can 
then rescale, clip, and otherwise manage your .css files.  It is especially
useful to check the "Average Packing Density" of the .css files
you use; when it approaches .7 to .8, you may want to consider enlarging your 
.css file.  To do that, see below on "Enlarging a .css file"  

Here's the -h help:

      Usage: cssutil [-b -r] [-s css-size] cssfile
                -h   - print this help
                -b   - brief; print only summary
                -r   - report then exit (no menu)
                -s css-size  - if no cssfile found, create new
                               cssfile with this many buckets.
                -S css-size  - same as -s, but round up to next
                               2^n + 1 boundary.




	   	    The cssdiff utility
                    -------------------

To get the difference between two .css files, use

    ./cssdiff somefile.css anotherfile.css

which writes out a summary of how two different .css files are. 



                    The cssmerge utility
                    --------------------

To merge two .css files, use cssmerge .

    ./cssmerge outfile.css infile.css

Note that this is _destructive_ to outfile.css, so make a copy
somewhere else first.  You _CAN_ merge two .css files of different
length.  You can also expand (or contract) a .css file this way:
rename the old file, and allow a new one to be created with learnspam
or learnnonspam while using the '-s nnnnnnnnn' s(lots) flag to set the
number of feature slots desired in the new file.  Then cssmerge your
old file into the fresh new file, and all is well.

Here's the cssmerge help:

   Usage: cssmerge <out-cssfile> <in-cssfile> [-v] [-s]
    <out-cssfile> will be created if it doesn't exist.
    <in-cssfile> must already exist.
     -v           -verbose reporting
     -s NNNN      -new file length, if needed
	

   


		Enlarging a .css file
                ---------------------

One of the advantages of CRM114 is that the .css files are relatively
small and of fixed size; they don't grow out of control and never need
trimming if you use <microgroom>, which is the default.

The disadvantage of this is that if your spam/nonspam discrimination
is too convoluted, it won't be able to sort them out ( in trek-speak
this is a high-order nonlinearity in the discrimination function ).
The fix in this situation is to increase the dimensionality of the
feature space.  The number of dimensions is about 1/12 the number of
bytes in the .css files; this works well at about a million dimensions
(12 megabytes) for most people.

But if you're not most people, you may need to (eventually) increase it.
You can tell when this is necessary- running
    
    cssutil

will give you a utilization and percentage of slots full; when that gets
up near 95 percent, you may be running low on space and old features
will be erased to make room for new features (that is, your feature set
will dynamically evolve in real time to find what works.)  However, that's
slow and may cause a slight loss of accuracy.

One way to fix this is to "increase the dimensionality of the
discrimination hyperspace" (no, I am not making that phrase up).  
It means to add new slots to the .css files.

The easiest way to do this is to 
    1) use cssutil to create a temporary, empty, larger .css file 
    2) merge the data from the old, small .css file onto the new big file.
    3) copy the new big file over the old, small file.

You can even combine steps 1 and 2, because newer versions of cssmerge
will create a new file if needed (the -s N flag sets the number of slots
in the new file; -S N does the same thing but rounds up to a 2^N+1 
boundary, which is recommended ).

For example, here's how to increase the size of the spam.css file
from 1,000,001 slots (the default) to 2,000,001 slots.  Just type:

   cssmerge temporary.css spam.css -s 2000001
   mv temporary.css spam.css

The newly replaced spam.css will have all of the features of the old
spam.css file, but will be 2000001 slots long instead of the default
1000001 slots.

--------------------------------------------------------------------
--------------------------------------------------------------------

  		  APPENDIX 1
               Using mailtrainer.crm


New (as of 20060117) is the training program mailtrainer.crm .  This
program will take directories of spam and nonspam files, and iterate over
them to build (or improve) a set of .css files for you.


                 ***** WARNING WARNING WARNING *****
    Mailtrainer.crm (and the documentation for it) is BETA QUALITY.
    There are very likely some very amusing bugs.  Be warned !!!
    Archive your data and your .css files before using mailtrainer.crm.
    Really!  
                 ***** WARNING WARNING WARNING *****


Mailtrainer by default uses whatever settings are in your current
mailfilter.cf file, so you'll get .css files that are optimized for
your standard setup including mime decoding, normalization, classifier
flags, etc.

However, this means you *must* set up your mailfilter.cf file FIRST,
before you run mailtrainer.

Mailtrainer.crm uses DSTTTTR (Double Sided Thick Threshold Training with
Testing Refutation) which is something I didn't come up with (Fidelis
is on my list of suspects for this).  The good news is that this can
more than double accuracy of OSB and similar classifiers.  

It is safe to run mailtrainer.crm repeatedly on a .css fileset and
training data; if the data doesn't need to be trained in, it won't be
(unlike the old "make cssfiles" command, which forces everything in
whether it is useful or not).  This is a big improvement and minimizes
.css file bloating.  "make cssfiles" has now been fixed to use
mailtrainer.crm.

The example files in each of the spam and good directories need to be
one example per file.  The closer these files are to what
mailfilter.crm wil see in real life the better your training will be.
Preferably the headers and text will be complete, intact, and
unmutilated.  The closer these examples are to what SMTP will
show "on the wire" the better.

If you use a mail reader that puts your "good" and "spam" emails 
as separate files in two different directories (or can hack up a script
to do that) then you could even run mailtrainer.crm automatically 
every night to optimize the .css files to your current profile.  
If you do this, your script needs to gaurd against situations where
you haven't checked your mail in a few days and errors crept in; for 
safety your script should only add the files to the training directories
until you have hand-checked them (or at least tacitly agreed).  

If you find you've made a mistake, don't worry.  It's recoverable.
Just put the misplaced files into the correct directory and rerun 
mailtrainer.crm .  That will re-optimize the .css files (though some
low-value features may be swept away).  

Alternatively, if you start out keeping each and every file that
you've trained, you can just delete the erroneous spam.css and nonspam.css
files and re-run mailtrainer.crm to get correct .css files.

It's OK to have the spam and good directories just be full of links
(either symlinks or hardlinks) to the actual spam and good mail files
(that's what I do).

NOTE: mailfilter.crm doesn't (yet) understand how to build and maintain
the spam and good email directories.  

NOTE 2: It is at this point unknown whether it's a good idea or a bad
idea to run mailtrainer on the probably good and probably bad emails
(which end up in the reaver cache as .../prob_good/whatever and
.../prob_spam/whatever, or just on those that are in the thick
threshold zone.  If anyone gets good data on this, let me know please.


   -----  Mailtrainer Options ---

The mailtrainer.crm options are as follows.  You *must* provide --spam
and --good; the other flags are optional.

Required:
        --spam=/directory/full/of/spam/files/one/per/file
        --good=/directory/full/of/good/files/one/per/file
                      These define the directories or files to be
                      learned.  If these end with a slash, it means a
                      directory and all of the files within are used,
                      otherwise, it's taken as a file.  If the
                      filename contains a wildcard, be sure to enclose
                      it in singlequotes 'like.this' or else BASH will
                      do bad things to it.  Note that this is
                      (currrently) incompatible with the --random
                      shuffling of training order.

Optional:
	--help      - quick synopsys of mailtrainer options.

        --thick=N   - thickness for thick-threshold training- this
                       overrides the thickness in your mailfilter.cf file.
		       Omit it if you want to use the in-file value.
		       (10 works well for most classifiers; use 0.1 
		       or less for Hyperspace)

        --streak=N  - how many successive correct classifications
                       before we conclude we're done.  Default is 10000.
		       This number should be larger than the total number
		       of sample emails.

        --repeat=N  - how many passes should we go through this corpus
                       before we conclude we're done.  Default is 1

        --worst=N   - run the entire training set, then train
                       only the N worst offenders, and repeat.  This is
                       excruciatingly slow but produces very compact
                       .css files.  Use this only if your host machine
                       is dreadfully short of memory.  Default is NOT
                       to use worst-offender training.  N=5% of your total
                       corpus works pretty well, but N=1 will produce
		       the most compact .css files.

        --random    - randomize the training set rather than taking the 
                       files in sequential alternating order (one from good,
                       then one from spam).  Note that this is (currently)
                       incompatible with a wildcard for selection of good 
                       versus spam files.

        --reload    - if we run out of one kind of file (good or spam)
                      before the other, "reload" (start from the first
                      file again) in that category.  Default is to 
                      simply use only the remaining category for the
                      remainder of the training pass.

        --verbose   - Verbose.  Print out more stuff.

	--fileprefix=directory - use the mailfilter.cf, rewrites.mfp,
                       and .css files in 'directory', rather
                       than in the current directory.

        --goodcss=somecssfile.css  -  use this 'good' cssfile instead of 
                       the default "nonspam.css"

        --spamcss=somecssfile.css  - use this 'spam' cssfile instead of
                       the default "spam.css"

        --collapse  - collapse the flying output down to scroll less on a TTY.
    
        --report_header="some text" - put this at the head of the report

        --rfile="somefilename.txt" - append (not overwrite!) log to this file.

        --validate=regex_no_slashes  - Any file with a name that matches
                       the regex supplied is not used for training; instead
                       it's held back and used for validation of the
                       new .css files.  The result will give you an
                       idea of how well your .css files will work.
                       Do NOT put slashes around the regex!


Example 1: 

  - We want to create new .css files for our mail filter
  - We already have presorted directories of good and spam email
  - We have already set up mailfilter.cf and rewrites.mfp to define
    our preferred configuration,

Then we can use the following incantation to build some nice .css
files (not perfect, but not bad).  This incantation can all be on one
line (remove the '\' backslash characters if you put it on one line),
and don't forget the trailing slash for directory names; otherwise
mailtrainer will try to train the directory listing itself (and fail,
because a directory can't be read like a normal file).

Note that you *must* set up your mailfilter.cf and rewrites.mfp files
first, before doing this, otherwise you'll generate bad .css files,
or possibly get an error!

       crm mailtrainer.crm \
            --good=/your/good/files_dir/  \
            --spam=/your/spam/files_dir/  \
            --repeat=5  \
            --random


Example 2:

 - We want to run mailtrainer.crm against a bunch of examples in the 
   directory ../SA2/spam/ and ../SA2/good/.  (This happens to be where
   the TREC test set is on my computer- your location will be different)  
 - We want to quit when we get 4000 tests in a row correct, or if 
   we go through the entire corpus 5 times.  
 - We want to use DSTTTR, with a training thickness of 5 pR
   units.  
 - We want to "validate" our training - that is, to hold back some fraction
   of the training set as test cases.  In our case here, we decide we want to
   use any file name that contains a "*3.*" .  These files will be
   saved up and used as a test corpus instead of for training.

Here's the command (this can all be on one line as well; if so,
remove the backslashes):

   crm mailtrainer.crm  \
              --spam=../SA2/spam/  \
              --good=../SA2/good/  \
              --repeat=5  \
              --streak=4000   \
              --validate=[3][.]  \
              --thick=5.0

This will take about eight minutes to run on the TREC 2005 SA corpus
of about 6000 messages; 1000 messages a minute is a good estimate
for 5 passes of DSTTTTR training.


Notes:

* If the .css statistics files don't exist, they will be created for 
you, in the format set up by the mailfilter.cf file.  So- be SURE to
set up mailfilter.cf first!

* If the first test file comes back with a pR of 0.0000 exactly, it is
assumed that these are empty .css statistics files, and that ONE file
will be trained in to each .css file that returns a 0.0000, simply to
get the system "off center" enough that normal training can occur.  If
there is anything already in the files, this won't happen.

* When running N-fold validation, if the filenames are named as in the
SA2 corpus in a form of 00123.456789 , there's an easy trick to
partition the data into 10 roughly equal sets.  Just use a validation
regex like [0][.] for the first run, [1][.] for the second run, [2][.]
for the third, and so on.  Notice that this a CRM114-style regex, and
_not_ a BASH-style file globbing as "*3.*" would be.  If you use a
globbing regex like "*3.*" , then BASH will suck it in and expand it
in-line to all of the individual filenames and that won't work.  A
regex like [chars] is invisible to BASH and so will pass unscathed.

* If you want to run N-fold validation, you must remember to delete
and rebuild a fresh set of .css files after each run, otherwise you
will not get valid results.

* N-fold validation does NOT run training at all on the validation set,
so if you decide you like the results, you can do still better by
running mailtrainer.crm once again, but DO NOT specify --validate.
That will train in the validation test set as well, and hopefully
improve your accuracy still more.



---------------------------------------------------------------------

That's all!  If you have errors or updates (or find bugs!) please
let me know; the best way is to join the CRM114-general mailing list; it's
on the webpage:
   
   http://crm114.sourceforge.net  

and ask there.  The reason for using the mailing list rather than
personal email is that personal email isn't archived, but the mailing
list _is_ both archived and read widely, so we not only create a
background archive of solutions but you will get a better answer back
faster than if you sent the email to me alone.

Enjoy, and good luck.

       -Bill Yerazunis
   
