		The CRM114 & Mailfilter HOWTO

		    -Bill Yerazunis , 2003-09-18
			(last update 2005-03-27)


This is the CRM114 Mailfilter HOWTO.  It describes how to set up CRM114 
and Mailfilter to filter your incoming mail, as of the version 
CRM114-20040328-BlameStPatrick.

This HOWTO doesn't describe _how_ CRM114 or Mailfilter works.  This just
will set you up enough so that you can start using CRM114 and Mailfilter
to filter your mail.  It assumes you are running on a Linux box;
getting the system running on *BSD, MacOS, or Windows will require
considerably more work than we describe here (and is a subject for
future HOWTOs).

Remember, CRM114 and Mailfilter are released under the GPL (license is
enclosed in any of the downloads).  There is NO WARRANTY WHATSOEVER
for this software to be useful in any way; it's going to tamper with
your incoming mail and you can easily imagine the dangers in that.

That said, I hope CRM114 and Mailfilter is useful to you; it's been
very useful to me.  It's been keeping my mailbox clear of clutter for
since 2002 I'm convinced it has better performance than I-the-human at
killing spam without accidentally deleting important mail.  I've
tested myself, and I-the-human is only about 99.7% or 99.8% accurate
at best; CRM114 is considerably more accurate than that - easily two
or three times more accurate.  (as of December 2003, it was 99.95%
accurate (N+1 statistics) on my incoming mail stream to a non-business
account.

-------------------------------------------------------------------

	Step 0:  Scientes Inamicae  (Know Thy Enemy)

These are the major steps in using CRM114 Mailfilter.  The steps are
pretty simple:

      1) Downloading what you need  

	 (it's just 1 or 2 files .gz files)

      2) Setting up the executables

	 (not more than ten commands to type, even if you're building
	 from the fresh source)

      3) Setting Up your .css files 

	 (not more than 2 files to edit of no more than 5 lines each,
	 plus typing one or two commands)

      4) Configuring Mailfilter

         (editing one file, most likely change is ONE line, and we tell
	 you which one)

      5) Engaging Mailfilter

	 (if you are using Procmail, this is cut-and-paste about ten lines,
	 otherwise it's create one file containing one line, and typing
	 up to three commands)

      6) Training CRM114 and Mailfilter

	 (whenever you get an error, you send it back to yourself,
	 using your current mail tool.  How hard can that be?)

      7) Adding Priority Lists, Whitelists, and Blacklists

         CRM114 supports whitelists, blacklists, term rewriting, and
	 some other features.  You can use these for "gauranteed delivery"
	 from people you really trust - or really hate.

      8) Useful Utilities

         Details on the cssutil, cssdiff, and cssmerge utilities.  You
	 don't need to know this, but you may find it useful.



-------------------------------------------------------------------------

                  Step 1: Downloading.

Get yourself a copy of a CRM114 kit.  The kits can always be found by visiting
the CRM114 homepage at:

   http://crm114.sourceforge.net

You will need at least the statically-linked binary kit (compiled to
run on any i386 or better Linux box); for best performance it is
suggested you get the source kit and compile it on the processor you
will be running CRM114 on.  If you do not have root privs on the box
you will be running CRM114 on, it is suggested you stay with the
statically linked binaries (this is because the recommended "TRE" REGEX
library requires either root to install, or major workaround mojo).

The kits are named:

        crm114-<version>.i386.tar.gz   (statically linked binaries)
and
	crm114-<version>.src.tar.gz    (complete source code + tests)
and     
	crm114-<version>.i386.rpm      (statically linked .rpm package)

These kit .gz files are fairly small; usually less than one megabyte
(currently around 800 Kbytes) so they will download quickly.

You will need to decide if you will be starting off with a pre-learned
set of .css files (.css means CRM114 Sparse Spectra) or if you will be
creating your own .css files from your own samples of spam and
nonspam.

In general, the pre-learned .css files will give you an initially more
accurate filter, but after some use and training the self-created
filter files will catch up with pre-learned files, and then the
self-created filter files will become _more_ accurate in the
long-term.

If you decide that you want to start with the pre-learned .css files,
you will also need to download:

        crm114-<version>.css.tar.gz

The .css files are rather large; this download may approach 50 megabytes.
(currently it's 8+ megabytes)

Download the kits you will need (at least one of .src.tar.gz or
.i386.tar.gz or .i386.rpm) and then proceed to "Step 2: Setting Up the
Executables"

--------------------------------------------------------------------------


                       Step 2: Setting Up the Executables

In this step, you will install four binaries into your system.  
The four binaries are:

    crm - the CRM114 "compute engine".  It's called "crm" because "crm114"
	  is too hard to type.
    cssutil - the .css file check/verify/edit program
    cssdiff - the .css file diff program
    cssmerge - the .css file merging program

One important point: do NOT install CRM114 or any of it's utilites
setuid or sgid to root.  If you do, that's just an invitation for
someone to utterly hose your system without even trying.  We're not
talking an intentional attack, just an inadvertent command or script
gone wierd could do it.

  -----

There are three ways you can set up these executables.  You can:

      a) install with a .rpm kit

      b) install with a .i386.tar.gz   (tarball of statically linked binaries)

      c) install with a .src.tar.gz    (tarball of complete source)

Note 1:
   If you do not have root on the machine you are installing on, you 
   may have some problems during the installation.  You may want to 
   reconsider using the statically linked binaries instead of compiling
   from sources.

  -----

  Method A: Installing from .rpm

    (note- we don't have a good RPM for the current rev, so this section is
    not really accurate) 

Become root, then type:

       rpm -ivh crm114-<version>.rpm 

and it'll all happen automagically.  

Now, you can test the install.  A quick test is to type:

     crm -v

which should report back the version of CRM114 you have just installed.

You can also run a quick "Hello, world!" by typing:

  crm '-{ output /Hello, world!  This is CRM114 version :*:_crm_version: .\n/}'

then hit ^D (end-of-file on *nix).  You;ll get back a response like:

  Hello, world!  This is CRM114 version 20040118-BlameEric .

If this works, you can proceed on to the next step - "Step 3: Setting
Up Your .CSS Files"

  -----

  Method B: Installing from .i386.tar.gz        

This method takes a few more commands to perform.  First, untar the
binary release.  Type:
	
	   tar -zxvf crm114-<version>.i386.tar.gz

You should now become root.  If you do not have root on your machine,
you _can_ execute CRM114 programs directly from your home directory,
by changing your $PATH appropriately; see your shell man page for how
to do this for your particular shell (it varies with the shell, so 
I can't tell you here how to do it) and skip to the end of this step.

Once you're root, type:

	    cd crm114-<version>

	    make install_binary_only

This will install the pre-built binaries of CRM1114 and the utilities
into /usr/bin.  This is the default install location for CRM114.  If
you want them installed in a different place, edit the Makefile and
change INSTALL_DIR (near the top of the Makefile) to a different
directory.

Note that if you type "make clean" you'll _delete_ your prebuilt
binaries, so don't do that!

Now, you can test your work.  Type

     crm -v

which will cause CRM114 to print out the version of itself you
just installed.  

You can also run a quick "Hello, world!" by typing:

  crm '-{ output /Hello, world!  This is CRM114 version :*:_crm_version: .\n/}'

then hit ^D (end-of-file on *nix).  You;ll get back a response like:

  Hello, world!  This is CRM114 version 20040118-BlameEric .



Congratulations!  You've now completed the installation of CRM114 and
utilities from prebuilt binaries.  Proceed to "Step 3: Setting Up Your
.CSS files". 

  -----

  Method C: Compiling from .src.tar.gz (source)

This method is the most complex.  Start by uncompressing and untarring the
big .src.tar.gz with the command:

	   tar -zxvf crm114-<version>.src.tar.gz

Now cd down into the crm114-<version> directory.  You will see many files
here.  

You now have a choice: you can build CRM114 with either the GNU regex
libraries (not recommended, as GNU regex can't handle embedded NULL
bytes and has other issues), or with the TRE regex library
(recommended; this is what you get with the precompiled binary kit).

By default, you will use the TRE regex library; however, this means
you have to build and install TRE.  You can either grab the most
recent vesion from the TRE homepage at http://laurikari.net/tre, OR
you can use the version that is pre-packaged with your CRM114 download.
(The pre-packaged version is tested against CRM114- the fresh one
may have new features.  Take your choice- it's good stuff either
way)

Fortunately, building and installing TRE is easy.  The TRE regex
library can peacefully coexist on the same system as the GNU regex
library.

To install TRE, become root, then type this (don't forget to tell
configure to "--enable-static" ) :

	     cd crm114-<version>

	     cd tre-<tre_version_number>

	     ./configure --enable-static
	     
	     make
	     
	     make install

You have now installed the TRE regex library as /usr/local/lib/libtre .

Depending on your choices in static versus dynamic linking, you _may_
need to also add /usr/local/lib to /etc/ld.so.conf, and then run
ldconfig as root.  Or not.  If, during the next steps, you get
annoying messages on the order of "can't find ltre" then this is
the thing to try.

Once TRE is built and installed you can compile CRM114 and the
accompanying utilities (cssutil, cssdiff, and cssmerge).  By default,
CRM114 installs into /usr/bin (_not_ /usr/local/bin -  if you want to
change this, change the definition of INSTALL_DIR near the top of the
file "Makefile").

Change directory back up to the CRM114 directory, then become root,
then (noting that no .configure step is necessary) type:

          cd ..

	  make clean

	  make install

This will compile, link, install, and strip the executables (stripping
gets rid of unnecessary debugging information and makes the executables
load faster and use less memory).

You can test your installation of CRM114.  Just type:

	crm -v

and CRM114 will report back the version of the install.

You can also run a quick "Hello, world!" by typing:

  crm '-{ output /Hello, world!  This is CRM114 version :*:_crm_version: .\n/}'

then hit ^D (end-of-file on *nix).  You;ll get back a response like:

  Hello, world!  This is CRM114 version 20040118-BlameEric .


Congratulations!  You've now completed the installation of CRM114 and
utilities from source.  Move on to the next step - "Step 3: Setting Up 
Your .CSS Files" .

  -----

If you _really_ want to test your installation, you can run it
against "megatest.sh", which attempts to test every code path in
the system (well, all of the non-error paths at least).  Coverage
is incomplete, but at least it's a strong confidence indicator.

Note that this only works if you've installed the TRE engine.  The GNU
regex engine has enough "fascinating behaviors" that it will get 
a lot of things wrong; the GNU regex package also doesn't handle 
approximate regexes at all, and since those are in the test set,
you'll error out on each of those as well.

The easy way to run megatest is:

    make megatest

which will report back any differences between what your local install
of CRM114 did and what the "known correct" results are.

If there are any differences between the supplied "megatest.log" and
your own results, OTHER than process IDs in the "MINION PROC" results,
please file a bug report to me and we'll figure out what went wrong.


------------------------------------------------------------------



		Step 3: Setting Up The Rewrites and .CSS files


The .css files ( CRM114 Sparse Spectra files) are the "memory" that
crm114 uses to statistically describe the words and phrases that 
characterize various kinds of mail.

The rewrites.mfp file controls how to "rewrite" incoming email so
that your incoming email conforms more closely to what might be
considered "archetypical".  The rewrites.mfp setup is optional;
if you build your own .css files (either from empty files, or 
from corpora) you can actually replace rewrites.mfp with an
empty file; you just won't be able to share your .css files with
anyone else.

       -----

       Step 3a - Setting up the Rewrites file.

Edit the file "rewrites.mfp" and replace the placeholders (in this
case, "wsy", "merl.com", and "mail.merl.com") with your corresponding
username, domain name, and mail server information.  These rewrite
rules will be used to "scrub" your sample text of user-specific
strings.  (note that this is only strictly necessary if you want to
use the pre-built .css files.  However, it is in general recommended,
so that you can "share/merge" your .css files with your friends.)

Note the "arrowheads" in the file.  They look like this:

     >->

This is a rewrite operator.  Anything that matches the regex on the
left-hand side of the arrowhead will be replaced with the text on
the right-hand side of the arrowhead.

Example: if your name was Agent Smith, your email account 
AgentSmith@the.matrix.org, and your mail router was mail.matrix.org at
IP address 192.168.10.5, then the rewrites.mfp file should look like:

  AgentSmith@the.matrix.org>->MyEmailAddress 
  [[:space:]]Agent Smith>-> MyEmailName
  mail.matrix.org>->MyLocalMailRouter
  192.168.10.5>->MyLocalMailRouterIP

The idea is to turn your email headers into headers that don't refer
to any of your own actual name, address, etc, but contain only the
strings "MyEmailAddress", "MyEmailName", "MyLocalMailRouter", and
"MyLocalMailRouterIP".

If you have more than one incoming email name , email address, server,
router, etc, add lines in rewrites.mfp for each email name, email
address, server, router, and so forth.  This is something you really
_should_ do, if you have more than one email path leading to the
account that leads to an account that is being filtered by CRM114 (if
you don't, a lot of learning will have to be repeated for each path,
which will cost you accuracy and use up valuable feature slots in the
.css files that you could use in more valuable ways otherwise.  On the
other hand, if you have multiple email addresses that all channel
through one CRM114 fileset, and the addresses recieve very different
ratios of spam and nonspam, then it _might_ be to your advantage to
not use rewrites.mfp, or replace it with an empty file, so that
the extra statistical information of the incoming email address
is not lost)

       -----

       Step 3B - Setting up the .CSS files


You have a choice here.  You can either use the pre-learned .css files
available from crm114.sourceforge.net ,  or you can build your own .css
files dynamically as spam and nonspam email come in.  We recommend the 
latter - build your own files dynamically, as that will result in the
best final accuracy.

In either case your .css files should be in the same directory as 
your mailfilter will "run" in (yes, this can be changed, but that's 
an advanced topic).  

The particular directory that the mailfilter "runs" in is variable 
and depends on your local setup.  Assuming you will use the ".forward"
hook, there are two likely situations.

If your mail service runs on your local machine (say, you have just
one machine - and I do hope you have a firewall in that case), then
mailfilter will almost certainly "run" in your home directory- the 
directory you're in when you log in.

If your mail service runs on a mail server (not your local machine),
then you will probably have a "home directory" on that machine as well,
and that's the directory that the mail filter will run in.

If neither of these is the case, you should ask your system 
administrator what the correct directory is.



  -----

  Method A - Build Your Own Empty .CSS Files

This method will give you the best final accuracy, but you will
spend more time training.  

This is the recommended method for users wanting the best accuracy.

To start from scratch, you need to create empty .css files.  The
cssutil program will do that for you.  Just type:

   cssutil -b -r spam.css
   
   cssutil -b -r nonspam.css

and you will have created _empty_ spam.css and nonspam.css files in
your current directory (that is, the files are full-size, but contain
no information.  They'll be full of binary zeroes). 

Once you have these empty files you will have a high (50% or so)
error rate for the first few hours, till you have 'taught' CRM114
what your particular mix of spam and nonspam looks like.  Proceed 
below to "Step 4: Configuring Mailfilter".

Many people want to "preload" their spam collection into CRM114.  This
is a bad idea.  CRM114 is optimized for TOE learning - "Train Only
Errors" learning; testing something like a quarter of a million test
cases has proven that it is better to train only errors, and _only_ _as_
_they_ _occur_, than to preload a bulk database into CRM114.

The statistics from the "torture test" (about 40,000 messages)
are that training _only_ errors, in realtime, will give about 2.1 times
better accuracy than force-training a big corpus, even if the messages
are the same messages and presented in the same order.  The "why" is
mathematically complicated, but there's an intuitive description in
the FAQ.

Again: you will achieve the best possible accuracy if you let CRM114
itself make errors that you correct in real time.

  -----

  Method B - Pre-LEARNed files:

This is the simplest method, but less accurate than method A.

If you choose to use the pre-learned .css files, you need to download
the appropriate crm114 .css.tar.gz file, and then you can just type:

   tar -zxvf crm114-<version_number>.css.tar.gz

and you'll get the two files "spam.css" and "nonspam.css" in your
current directory.

Note that the download is fairly large - between 8 and 50 megabytes,
and although this will give you a good starting point for your own
statistics, you will have a better (smaller, faster) final
configuration if you build your own .css files from scratch.

  -----


 Method C - Build And Preload .CSS Files From Fresh Spam and Nonspam

If you really feel you must start by preloading some sample spam, copy
your most recent 100Kbytes or so of your freshest spam and nonspam
into two files in the current directory.  These files MUST be named
"spamtext.txt" and "nonspamtext.txt" They should NOT contain any
base64 encodes or "spammus interruptus", straight ASCII text is
preferred.  If they do contain such encodes, decode them by hand
before you execute this procedure.

Remember- if you do it this way, you will NOT achieve the same level
of accuracy as if you use method A (training only errors, as they
occur) above.  The only reason you might ever do it this way is if you
need some spam filtering _NOW_ and accept that you are operating with
a suboptimal filter.  This filter will be worse by about a factor of
2.1 in accuracy and a factor of two worse in speed than one built in
the optimal way (that is, method A).

That said, here's how to proceed:

You should use approximately equal amounts of spam and nonspam.

Finally, type:

	 rm -rf spam.css

	 rm -rf nonspam.css

	 make cssfiles

to build your new spam.css and nonspam.css files.

Again, let me emphasize that doing this kind of "fast build" will
lead to a final filter that is _less_ accurate and learns _slower_
than a filter that is only trained on realtime spam/nonspam errors.

  -----

  CHECKING YOUR .CSS FILES

For all three methods of setting up your .css files, you can check that
the .css files are reasonable.  Use the "cssutil" utility:

    cssutil -b -r spam.css
    
    cssutil -b -r nonspam.css

You should get back a report something like this:

     Sparse spectra file spam.css statistics: 

     Total available buckets          :      1048576 
     Total buckets in use             :       506987  
     Total hashed datums in file      :      1605968
     Average datums per bucket        :         3.17
     Maximum length of overflow chain :           39  
     Average length of overflow chain :         1.84 
     Average packing density          :         0.48

Note that the packing density is 0.48; this means that this .css file
is about half full of features.  Once the packing density gets above
about 0.9, you will notice that CRM114 will take longer to process
text.  The penalty is small below packing densities below about 0.95
and only about a factor of 2 at 0.97 . 

Note - do NOT believe "ls -la" with respect to .css files!  Because
CRM114 uses memory mapping instead of file I/O (because it's much
faster to go through the page-fault tables than through the file I/O
system), the m_time and c_time never change, only the a-time, and that
only if your file system had the proper compile-time options to keep
track of the a_time.  Believe in what cssutil tells you- if new
features show up after learning, you _are_ learning and "ls -la" is
lying to you!



You can also see how easy it will be for CRM114 to differentiate
spam from nonspam with your .css files.  The utility "cssdiff" will
compare the statistical features of two .css files.  Try it:

    cssdiff spam.css nonspam.css

and you'll get back a report like:

   Sparse spectra file spam.css has 1048577 bins total
   Sparse spectra file nonspam.css has 1048577 bins total 

   File 1 total features            :      1605968
   File 2 total features            :      1045152

   Similarities between files       :       142039
   Differences between files        :      1279964

   File 1 dominates file 2          :      1463929
   File 2 dominates file 1          :       903113

Note that there's a big difference between the two files; in this case
there are about 10 times as many differences between the two files as
there are similarities.  That's pretty much typical.

Now, move on to "Step 4: Configuring Mailfilter".

------------------------------------------------------------------------

	Step 4: Configuring Mailfilter

In this step you will tell Mailfilter what you want it to do with your
mail.  All of the options are controlled by editing one file,
named "mailfilter.cf" .  

By default, Mailfilter looks for mailfilter.cf in the initial
directory.  If you use "--fileprefix=/some/where/else/" on the command
line, mailfilter.crm will look for mailfilter.cf (and the other
runtime filtering files!) in the "/some/where/else/" directory.  This
--fileprefix mode is handy when you are setting up many users.

The format of mailfilter.cf itself is pretty simple.

0) blank lines are OK.
1) comments start with a # in column 1.  
2) Anything not a comment is a var setting, in the format:

   :var_to_set: /Value_to_set_goes_here/

All of the user-settable configuration vars have setup lines in mailfilter.cf.

First, you MUST change the secret password.  This is defined near the top
of the file.  Your password may contain a-z, A-Z, 0-9, but no blanks
or punctuation (at least for now).  You _must_ set this password to
something not easily guessable.  If you don't set it, you won't be able
to use mailfilter's remote commanding facility.

At first, you will probably want to leave the "log_to_allmail.txt"
enabled while you get used to CRM114.  Likewise, leave
"log_rejections" set to yes as well; that way you can easily see (with
"less" or "tail") just what is being rejected.  Once you get more
experience with CRM114, you can set these to "no" and not use up disk
space in these "extra safety" logs.

You can skim-read the rest of mailfilter.cf .  There are three
typical cases for most users:

1) If you are using Procmail:

  --> You probably will NOT need to change any of the other options.  

2) If you are NOT using Procmail, and your mail reading program can
   sort out email into folders based on whether the SUBJECT header
   contains the telltale string "ADV:" (most mail readers can do this):

  --> You probably will NOT need to change any of the other options.

3) You are NOT using Procmail, and your mail reading program is "dumb"
   (cannot sort email into folders based on subject line):

  --> You probably will want to define a separate account that will
   recieve all spam caught (otherwise, you'll just get all your spam
   delivered as usual, with additional headers telling you it was
   spam).

   To do this, look down to ":general_fails_to:".  Insert the full
   username@domainname.tld mail address where you want your spam to be
   sent.
   

  Note on mime decoders: There are a number of them available; the defaults
  given in mailfilter.cf may or may not be valid on your system.  Further,
  it may have a different path than the default given in mailfilter.cf.
  Yet further, you may want to load your own, like "normalizemime" (see
  the crm114.sourceforge.net web page for details on the download).

You can also configure the verboseness (or not) of your filtered 
results.  You can go from "no changes" (not even a statistical label
in the headers) to complete results including an expansion of any
base64 texts and HTML decommented strings.  

Feel free to change things to get the look and feel you want; after
all, what good is open source if you don't change it?  :)

HOWEVER, Please don't muck with variables that aren't in the 
mailfilter.cf file. "You make a mess, you clean it up."  :-)

After making these changes, write out "mailfilter.cf".  You may
later go back and change the configuration options, but the options
as already set are good for most users.  You do not need to do anything
to "load in" the new options, as CRM114 reads them in fresh from the 
file during initialization for each email.

Now, edit the file "rewrites.mfp".  Make the changes to insert your name,
your domain, your local mail router, and your local mail router's IP	
address as specified by the placeholders.  (again, strictly speaking
this is not absolutely necessary, but it's good hygene and will allow
you to swap and merge .css files with your friends)

If you have more than one possible mailserver, mail router, domain, etc.
you can add extra lines to rewrites.mfp as desired.  This is very
handy for systems that have more than one IP address accepting mail.

  -----

Once you have set up mailfilter.cf and rewrites.mfp, you can
test your configuration by typing the following (The '^D' at the
end is a control-D, which is an END-OF-FILE on Linux.  Other systems
may use a different END-OF-FILE character):

    ./mailfilter.crm 
    This is a test.  Just type a few lines of text
    that you might ordinarily get, like a short rant on why
    Perl is useless for big projects, or why Linux is
    superior or inferior to NetBSD.
    ^D

If you have set up Mailfilter for Procmail-style filtering you will
always get a small report back saying something like either of these
(the actual numbers will change, but you should have something that
_vaguely_ looks like the following):

  From foo@bar  Thu Sep 18 19:20:35 2003
  X-CRM114-Status: Good  ( pR: 12.630237  )

   ** ACCEPT: CRM114 PASS SBPH/BCR TEST** 
  Probabilistic match quality: 1.000000, pR: 12.630237 
  P(succ): 1.000000e-00, P(fail): 2.342950e-13 
  Features: 336, S hits : 4313, F hits : 5901 
 
or:

  From foo@bar  Thu Sep 18 19:19:39 2003
  X-CRM114-Status: SPAM  ( pR: -2.866484  )

   ** REJECT: CRM114 FAIL SBPH/BCR TEST** 
  Probabilistic match quality: 0.001358, pR: -2.866484 
  P(succ): 1.358082e-03, P(fail): 9.986419e-01 
  Features: 144, S hits : 2337, F hits : 3313 
 
If you are using "mail to spamtrap account" filtering, then you will
either get an "accept" report back (the first report above is an
"accept") or the text you typed in will be mailed to your spamtrap
address.  If you don't get a report back, check the spamtrap address
and see if your test text ended up there.

If you don't get _either_ of the above, something is broken, either in
your installation of CRM114 or in your configuration file.  You need
to fix the problem before you engage Mailfilter.

If your installation and configuration passes the above test,
congratulations!  You have now configured mailfilter.crm .  Onward, to
"Step 5: Engaging Mailfilter".

----------------------------------------------------------------------------

	Step 5: Engaging Mailfilter

There are two common ways to engage Mailfilter.crm on your incoming
mail stream: you can use Procmail recipes and have Mailfilter run as a
procmail subprocess, or you can use the .forward hook of Sendmail (and
Sendmail clones which also support .forward)

In the first method (recommended), you use Procmail's ability to
execute a program as part of a Procmail recipe to run CRM114, which
adds headers as needed to let Procmail or your mail-reading program do
the sorting.
  
In the .forward method, you (or your system manager) must add a link
from an execution command of crm114 to the directory /etc/smrsh.  This
is because sendmail will NOT run any program that isn't "approved" by
the system manager (by linking it into /etc/smrsh/whatever).  The output
of mailfilter is then directly appended to your /var/spool/mail file
(or possibly forwarded to your spam-bucket account).

  -----

    Method A: For Procmail Users

For Procmail users just add a procmail recipe to .procmailrc to run
CRM114 and mailfilter whenever your other procmail rules fail to
decide what to do.

Here's a sample Procmail recipe set.  Notice that we actually have TWO
recipes - one to actually run crm114 and mailfilter, the other to 
then sort the mail based on the result.

 #
 #

 :0fw: .msgid.lock
 | /usr/bin/crm -u /home/my_user_directory mailfilter.crm
  
 :0:
 * ^X-CRM114-Status: SPAM.*
 mail/crm-spam

That's all that Procmail users should need.  Mailfilter should now be
active - send yourself a test message and see where it ends up.

If you get the test messsage, proceed to "Step 6: Training CRM114".

( note: Sub-Method A-one)

If you use an MUA that can highlight on headers, you can use something
like this in your procmail (from Philipp Weiss):

in .procmailrc

 CRMSCORE=`$HOME/bin/crmstats.sh`
 :0fw: .formail.crm114.lock
   | formail -I "X-CRM114-Score: $CRMSCORE"

where ~/bin/crmstats.sh is a simple script:

 #!/bin/bash
 grep -a -v "^X-CRM114" | \
   /usr/bin/crm -u $HOME/.crm114 mailfilter.crm --stats_only

Advanced Topic: Huge Emails and Denial Of Service Avoidance

CRM114 has a built-in anti-Denial-of-Service (anti-DoS) feature in
that it will not grow buffers beyond a certain limit.

However, you may find that you actually recieve emails bigger than
this limit.  In these cases, it is effective to simply filter on
the first few tens of kilobytes of incoming text.

This is easy to do with "head".  head -c 10000 gives the first 10,000 
characters of input, which is usually adequate for CRM114 to get a 
good decision on.  This can be directly piped in right in the procmail
command:

 :0fw: .msgid.lock
 | head -c 10000 | /usr/bin/crm -u /home/my_user_directory mailfilter.crm
  
 :0:
 * ^X-CRM114-Status: SPAM.*
 mail/crm-spam



  -----

    Method B: The .forward hook file

For .forward hook users you should be aware that you should NOT
put a direct link to crm in /etc/smrsh; since crm can do arbitrary things,
you ought to attempt to control the damage as much as possible.

 1) add a link from /etc/smrsh to crm114's executable binary in
   /usr/bin by becoming root and typing:

   cat > /etc/smrsh/crmfilter 
   /usr/bin/crm mailfilter.crm >> /var/spool/mail/your_account_name_here
   ^D
   
 2) add a .forward file to your account by typing:

   cat > .forward
   |/etc/smrsh/crmfilter
   ^D

That's all.  The mailfilter should now be active - send yourself a test
message and see where it ends up.

  ----

Once you have engaged CRM114 mailfilter, you now get to train it to 
recognize spam and nonspam.  Proceed to "Step 6: Training CRM114".

Note: CRM114 contains a design decision that you may have to play 
with.  Instead of doing memory management games, which both consume
significant runtime CPU as well as present a major denial-of-service
opportunity, CRM114 has an upper limit on the window size and it simply
won't exceed that limit (it gives an error message if an incoming
message tries to exceed the limit)

You -can- change the maximum memory limit at runtime with the -w nnnnn
flag; for example, if you want 100 megabytes of memory available, you can 
set that with

    ...  -w 100000000

to set 100,000,000 bytes as the hard limit ceiling on memory usage.

---------------------------------------------------------------------------


	Step 6: Training CRM114 and Mailfilter

One of the great strengths of CRM114 Mailfilter is that it has no
preconcieved notions of "spam" and "nonspam".  It _learns_ what you
consider spam, and what you consider nonspam.

For the first few days CRM114 will make a lot of mistakes sorting
spam and nonspam.  It is _very_ important that you train each mistake
back into CRM114, otherwise it will never learn what you consider spam
or nonspam.

You should train in the mistake as quickly as possible.  Start one
morning and try to train every hour for the first few hours at least.
Don't think you're training a computer- pretend you're housebreaking a
new puppy.

You train mistakes right from your mail reader.  There are two ways
to do this...

The first way is to use the built-in command feature.  Just forward
the mistake back to yourself, with full headers (except edit out any
CRM114-added headers or text).

Just before the first line of the text to be "learned" as spam or
nonspam insert a COMMAND line.  Everything from the command
line to the end of the message will be learned (so edit the text
to remove things you _don't_ want considered indicative of spam/nonspam
nature).

The command line looks like this:

	command <yoursecretpassword> spam
or
	command <yoursecretpassword> nonspam

The "c" in "command" must be in column 1, and you must put your 
secret password into the command line.  Don't use the <> brackets,
use JUST your secret password.

Examples: If your secret password was "Ihatespam", then the command line
to learn something as spam would be:

	command Ihatespam spam

and the command to learn something as nonspam would be:

        command Ihatespam nonspam

The second way to train in spam and nonspam is to use mailfilter.crm's
command line options.  When you find a spam that was mistakenly 
accepted as good mail, pipe it through mailfilter.crm with the 
"--learnspam" flag set, like this:

	 bash> mailfilter.crm --learnspam < the_spam.txt

Likewise, if you get an email that was falsely classified as a spam,
pipe it through mailfilter with the "--learnnonspam" flag set, like
this:

         bash> mailfilter.crm --learnnonspam < the_NON_spam.txt

(yes, if you have a scriptable mail reader, you can put these 
functions right on the menu bars somewhere.  Yes, that's a hint.  :) )

For both ways: try to train _approximately_ equal amounts of spam and
nonspam.  If you are within 50% one way or the other, performance will
be very good.

Train only errors!  This is called TOE training.  (TOE :== Train Only
Errors) It's not necessary to train near-misses; experiments show that
the performance increase on training near misses is miniscule at best,
and may be negative at times.

It's best for at least the first day or so, you check your mail at
least every hour or so and send training information back to CRM114.
This will help it rapidly converge on a good set of statistics for
your particular mix of spam and nonspam.

It will take several days worth of errors for CRM114's mailfilter to
approach 95% accuracy, and around two weeks to a month to reach 99+
per cent accuracy.  I usually exceed 99.9% accuracy (less than one
error per thousand).



      What To Do if CRM114 says "LEARNING UNNECESSARY..."
      ---------------------------------------------------

Occasionally, some CRM114 configurations may refuse to learn an errror,
claiming that it "got it right the first time" (yes, this is a subtle bug
that is not allowing itself to be found, but there is reason to believe
it has to do with the interaction of mail clients and headers and that
some mail readers don't give you the headers, the full headers, and
nothing but the headers.)

While we applaud this self confidence, the error is still there, so
you need to "force" the learning.  You can do this either from BASH or
from the mail-to-yourself command line.  For BASH, add "--force" to
the command line; for mail-to-yourself commands, just add "force"

From BASH, add --force to the command line:

  # mailfilter.crm < the_error_text --learnspam --force

for mail-to-yourself, add "force" to the command line:

  command mysecretpassword spam force

(and similarly for nonspam).



         The training files "spamtext.txt" and "nonspamtext.txt"
	 ------------------------------------------------------

Whenever CRM114 learns a new spam or nonspam, it not only modifies
the .css files, but it also keeps the source text of that learning
in the files "spamtext.txt" or "nonspamtext.txt".  

These two files can be considered the "source code" of your .css
files; they're all you really need to rebuild your .css files if/when
you upgrade CRM114 and the .css file is changed but the algorithm is
similar.  For example, upgrading from Markovian filtering (the
default) to Winnow or OSBF is "incompatible", and you might want to
start with these files as a kickstart.

... but not necessarily; some filtering is radically different than
Markovian; as we add new filters as technology moves forward, 
sometimes we will be able to kickstart, and sometimes we can't.

   - for upgrades that can use the current .css files, we will say so;

   - for upgrades that cannot use the current .css files, but *can*
     get kickstarted from spamtext.txt and nonspamtext.txt, we will
     say so;

   - for upgrades that are radically different enough that you must
     relearn from scratch, we will say so (and have you rename your
     old spamtext and nonspamtext files so that they will not be
     accidentally reused.

If your mail system is so short of disk that you cannot afford to keep
these (relatively) small files, then you may either delete them or
symlink these files to /dev/null; you don't absolutely *need* them.
These files are quite small though- I have been running CRM114 for
nearly five years now and my *total* example text sizes are 678 Kbytes
for nonspam and 893 Kbytes for spam (after something like five years
of daily use and about a gigabyte of email).



-----------------------------------------------------------------------



	Step 7: Adding Priority Lists, Whitelists, and Blacklists

If you really want, you can add white, black, and priority lists
to CRM114.  Most people don't need them, but there are always
exceptions.

For example, your lawyer, your boss, and your paramour all probably
rate being on your "whitelist", so whatever they send to you is always
marked "nonspam".  Likewise, your ex-girlfriend/boyfriend, your
nagging acquaintance, and the stalker from the library should all get
blacklisted.

Whitelisting, blacklisting, and prio-listing are all based on regex
matching.  If the regex you put in the file "whitelist.mfp" matches
the incoming mail _anywhere_, the mail will be marked "good" no matter
how it scores statistically.  Similarly, if the mail matches any regex
in "blacklist.mfp", the mail will be marked as "spam", no mattter how
it compares statistically.


Note that sometimes this can cause considerable confusion, for example
"ac.com" in a whitelist will not just match "billing.ac.com", but also
"drac.complete.viagra.sales.com"  (the  match  being the  'ac.com'  in
"drac.complete").  To prevent this, use  ^ and $ to "anchor" the start
and end of the regex, if possible.

Lastly (well, actually firstly, because prio-listing happens before
whitelisting or blacklisting) any mail that matches any regex in
priolist.mfp .  The format of priolist.mfp is that the first character
on the line is a + or a -, which indicates "whitelist" or "blacklist",
and the rest of the line is a regex.  These regexes are tested
in the order given in the file.  An empty file is perfectly acceptable.

For examples of how to set up the whitelist, blacklist, and priolist
files, see the included "whitelist.mfp.example", "blacklist.mfp.example",
and "priolist.mfp.example".

Note: for my accuracy tests, I *turn off* whitelists, blacklists,
and prio-lists.  

Be sure to test any whitelist, blacklist, or other list that you 
add, otherwise you may get a rude surprise some day.


----------------------------------------------------------------


           Step 8: Useful Utilities

You don't _need_ to know the stuff in this section to set up and use
CRM114 and mailfilter but it might be useful to you- or at least 
satisfy some of your curiosity.

There are three utilities for dealing with the .css files (these
are the files that contain the "learned information").

The utilities are:

    cssutil - gives you a readout of the characteristics of the
	      information in a .css file  

    cssdiff - gives you a summary of the differences between two
	      .css files (handy for seeing learning!)

    cssmerge - merges two .css files into one; handy for importing
	      new data into a .css file.  Note that this is 
	      a destructive operation on the first .css file named!


                   The cssutil utility:


Usage is
    
    cssutil somefile.css

which will give you statistics on the file somefile.css.  You can 
then rescale, clip, and otherwise manage your .css files.  It is especially
useful to check the "Average Packing Density" of the .css files
you use; when it approaches .7 to .8, you may want to consider enlarging your 
.css file.  To do that, see below on "Enlarging a .css file"  

Here's the -h help:

      Usage: cssutil [-b -r] [-s css-size] cssfile
                -h   - print this help
                -b   - brief; print only summary
                -r   - report then exit (no menu)
                -s css-size  - if no cssfile found, create new
                               cssfile with this many buckets.
                -S css-size  - same as -s, but round up to next
                               2^n + 1 boundary.




	   	    The cssdiff utility
                    -------------------

To get the difference between two .css files, use

    ./cssdiff somefile.css anotherfile.css

which writes out a summary of how two different .css files are. 



                    The cssmerge utility
                    --------------------

To merge two .css files, use cssmerge .

    ./cssmerge outfile.css infile.css

Note that this is _destructive_ to outfile.css, so make a copy
somewhere else first.  You _CAN_ merge two .css files of different
length.  You can also expand (or contract) a .css file this way:
rename the old file, and allow a new one to be created with learnspam
or learnnonspam while using the '-s nnnnnnnnn' s(lots) flag to set the
number of feature slots desired in the new file.  Then cssmerge your
old file into the fresh new file, and all is well.

Here's the cssmerge help:

   Usage: cssmerge <out-cssfile> <in-cssfile> [-v] [-s]
    <out-cssfile> will be created if it doesn't exist.
    <in-cssfile> must already exist.
     -v           -verbose reporting
     -s NNNN      -new file length, if needed
	

   


		Enlarging a .css file
                ---------------------

One of the advantages of CRM114 is that the .css files are relatively
small and of fixed size; they don't grow out of control and never need
trimming if you use <microgroom>, which is the default in mailfilter.crm .

The disadvantage of this is that if your spam/nonspam discrimination
is too convoluted, it won't be able to sort them out ( in trek-speak
this is a high-order nonlinearity in the discrimination function ).
The fix in this situation is to increase the dimensionality of the
feature space.  The number of dimensions is about 1/12 the number of
bytes in the .css files; this works well at about a million dimensions
(12 megabytes) for most people.

But if you're not most people, you may need to (eventually) increase it.
You can tell when this is necessary- running
    
    cssutil

will give you a utilization and percentage of slots full; when that gets
up near 95 percent, you may be running low on space and old features
will be erased to make room for new features (that is, your feature set
will dynamically evolve in real time to find what works.)  However, that's
slow and may cause a slight loss of accuracy.

One way to fix this is to "increase the dimensionality of the discrimination
hyperspace" (no, I am not making that phrase up).   It means to add 
new slots to the .css files.

The easiest way to do this is to 
    1) use cssutil to create a temporary, empty, larger .css file 
    2) merge the data from the old, small .css file onto the new big file.
    3) copy the new big file over the old, small file.

You can even combine steps 1 and 2, because newer versions of cssmerge
will create a new file if needed (the -s N flag sets the number of slots
in the new file; -S N does the same thing but rounds up to a 2^N+1 
boundary, which is recommended ).

For example, here's how to increase the size of the spam.css file
from 1,000,001 slots (the default) to 2,000,001 slots.  Just type:

   cssmerge temporary.css spam.css -s 2000001
   mv temporary.css spam.css

The newly replaced spam.css will have all of the features of the old
spam.css file, but will be 2000001 slots long instead of the default
1000001 slots.


---------------------------------------------------------------------

That's all!  If you have errors or updates (or find bugs!) please
let me know; the best way is to join the CRM114-general mailing list; it's
on the webpage:
   
   http://crm114.sourceforge.net  

and ask there.  The reason for using the mailing list rather than
personal email is that personal email isn't archived, but the mailing
list _is_ both archived and read widely, so we not only create a
background archive of solutions but you will get a better answer back
faster than if you sent the email to me alone.

Enjoy, and good luck.

       -Bill Yerazunis
   
