THESEUS(1)                     Likelihood Rocks                     THESEUS(1)



NAME
       theseus - Maximum likelihood, multiple simultaneous superpositions with
       statistical analysis

SYNOPSIS
       theseus  [-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ]   pdbfile1
       [pdbfile2 ...]

       and

       theseus_align   [-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ]  -f
       pdbfile1 [pdbfile2 ...]

       Default usage is equivalent to:

       theseus -a0 -e2 -g1 -i200 -k-1 -p1e-7 -r theseus -v -P0 your.pdb

DESCRIPTION
       Theseus superpositions a set of  macromolecular  structures  simultane-
       ously using the method of maximum likelihood (ML), rather than the con-
       ventional least-squares criterion.  Theseus assumes that the structures
       are  distributed  according  to a matrix Gaussian distribution and that
       the eigenvalues of the atomic covariance matrix are hierarchically dis-
       tributed according to an inverse gamma distribution. This ML superposi-
       tioning model produces much more accurate results by essentially  down-
       weighting variable regions of the structures and by correcting for cor-
       relations among atoms.

       Theseus operates in two main modes, a mode for superimposing structures
       with  identical  sequences  and  a  mode  for structures with different
       sequences but similar structures:

              (1) A mode for superpositioning  macromolecules  with  identical
              sequences and numbers of residues, for instance, multiple models
              in an NMR family or multiple structures from  different  crystal
              forms of the same protein. In this mode, Theseus will read every
              model in every file on the command line and superposition  them.

              Example:

              theseus 1s40.pdb

              In the above example, 1s40.pdb is a pdb file of 10 NMR models.

              (2)  An  "alignment"  mode  for superpositioning structures with
              different sequences, for example,  multiple  structures  of  the
              cytochrome  c protein from different species or multiple mutated
              structures of hen egg white lysozyme.  This  mode  requires  the
              user to supply a sequence alignment file of the structures being
              superpositioned (see option -A and "FILE FORMATS" below).  Addi-
              tionally,  it  may  be  necessary to supply a mapfile that tells
              theseus which PDB structure files correspond to which  sequences
              in  the alignment (see option -M and "FILE FORMATS" below). When
              superpositioning based on a seqeunce alignment, theseus  uses  a
              novel maximum likelihood algorithm for superpositioning multiple
              structures that include arbitrary gaps and  insertions  relative
              to  each other.  Unlike other algorithms for simultaneous super-
              positioning of multiple structures, our Expectation-Maximization
              algorithm  uses  all  available  data  by including all residues
              aligned with gaps in the calculations.  In this mode,  if  there
              are multiple structural models in a PDB file, theseus only reads
              the first model in each file  on  the  command  line.  In  other
              words,  theseus treats the files on the command line as if there
              were only one structure per file.

              Example 1:

              theseus -A  cytc.aln  -M  cytc.filemap  d1cih__.pdb  d1csu__.pdb
              d1kyow_.pdb

              In  the above example, d1cih__.pdb, d1csu__.pdb, and d1kyow_.pdb
              are pdb files of cytochrome c domains from the SCOP database.

              Example 2:

              theseus_align -f d1cih__.pdb d1csu__.pdb d1kyow_.pdb

              In this example, the theseus_align script is called  to  do  the
              hard  work  for you.  It will calculate a sequence alignment and
              then superimpose based  on  that  alignment.   The  script  the-
              seus_align takes the same options as the theseus program.  Note,
              the first few lines of this script must  be  modified  for  your
              system,  since  it calls an external multiple sequence alignment
              program to do the alignment.  See the  examples/  directory  for
              more details, including example files.

OPTIONS
   Algorithmic options, defaults in {brackets}:
       -a [selection]
              Atoms  to  include  in the superposition.  This option takes two
              types of arguments, either (1) a number specifying a preselected
              set  of atom types, or (2) an explict PDB-style, colon-delimited
              list of the atoms to include.

              For the preselected atom type  subsets,  the  following  integer
              options are available:

               o 0, alpha carbons for proteins, C1' atoms for nucleic acids
               o 1, backbone
               o 2, all
               o 3, alpha and beta carbons
               o 4, all heavy atoms (no hydrogens)

              Note,  only  the  -a0  option is available when superpositioning
              structures with different sequences.

              To custom select an explicit set of atom types, the  atom  types
              must  be  specified  exactly  as  given  in  the PDB file field,
              including spaces, and the atom-types must encapsulated in quota-
              tion  marks.   Multiple atom types must be delimited by a colon.
              For example,

              -a' N  : CA : C  : O  '

              would specify the atom types in the peptide backbone.



       -c     Use ML  atomic  covariance  weighting  (fit  correlations,  much
              slower)

              Unless  you  have  many  different structures with few residues,
              fitting the correlation matrix is likely  unwarranted  statisti-
              cally due to a plethora of parameters and a paucity of data.


       -e [n] Embedding algorithm for initializing the average structure
               o 0 = none; use randomly chosen model
               o {2} = {ML embedded structure}


       -f     Only read the first model of a multi-model PDB file


       -g [n] Hierarchical model for variances
               o 0 = none (may not converge)
               o {1} = inverse gamma distribution


       -h     Help/usage


       -i [nnn]
              Maximum iterations, {200}


       -k [n] constant  minimum  variance  {-1} {if set to negative value, the
              minimum variance is determined empirically}


       -p [precision]
              Requested relative precision for convergence, {1e-7}


       -r [root name]
              Root name to be used in naming the output files, {theseus}


       -s [n-n:...]
              Residue selection (e.g. -s15-45:50-55), {all}


       -S [n-n:...]
              Residues to exclude (e.g. -S15-45:50-55) {none}

              The previous two options  have  the  same  format.  Residue  (or
              alignment column) ranges are indicated by beginning and end sep-
              arated by a dash.  Multiple ranges, in any arbitrary order,  are
              separated by a colon.  Chains may also be selected by giving the
              chain ID immediately preceding the residue range.  For  example,
              -sA1-20:A40-71  will  only  include residues 1 through 20 and 40
              through 70 in chain A. Chains cannot be specified when  superpo-
              sitioning structures with different sequences.


       -v     use ML variance weighting (no correlations) {default}


   Input/output options:
       -A [sequence alignment file]
              Sequence  alignment  file to use as a guide (CLUSTAL or A2M for-
              mat)

              For  use  when  superpositioning   structures   with   different
              sequences.  See "FILE FORMATS" below.


       -E     Print expert options


       -F     Print FASTA files of the sequences in PDB files and quit

              A  useful option when superpositioning structures with different
              sequences.  The files output with this  option  can  be  aligned
              with  a  multiple  sequence alignment program such as CLUSTAL or
              MUSCLE, and the resulting output alignment file used as  theseus
              input with the -A option.


       -h     Help/usage


       -I     Just calculate statistics for input file; don't superposition


       -M [mapfile]
              File that maps sequences in the alignment file to PDB files

              A  simple  two-column  formatted file; see "FILE FORMATS" below.
              Used with mode 2.


       -n     Don't write transformed pdb file


       -o [reference structure]
              Reference file to superposition on, all rotations  are  relative
              to the first model in this file

              For   example,   'theseus   -o   cytc1.pdb  cytc1.pdb  cytc2.pdb
              cytc3.pdb' will superposition  the  structures  and  rotate  the
              entire  final superposition so that the structure from cytc1.pdb
              is in the same orientation as  the  structure  in  the  original
              cytc1.pdb PDB file.


       -O     Olve's segID file

              Useful  output  when  superpositioning structures with different
              sequences (mode 2).  In 'theseus.pdb', the main output  superpo-
              sition  PDB  file,  the  segID field now holds the number of the
              sequence alignment column that  it  belongs  to.   This  number,
              divided  by  100,  is  also  echoed in the B-factor field.  When
              using O (or any other capable molecular visualization  program),
              one can then color by B-factor ranges and immediately see in the
              superposition which regions of the structure are aligned in  the
              sequence  alignment  file.   An  additional file is also output,
              called 'theseus_olve.pdb' which only  contains  the  very  atoms
              that  were  included  in the ML superposition calculation.  That
              is, it will only contain alpha carbons or phosphorous atoms, and
              it will only contain atoms from the columns selected with the -s
              or "-S" options.  Requested by Olve Peersen  of  Colorado  State
              University.


       -V     Version


   Principal components analysis:
       -C     Use covariance matrix for PCA (correlation matrix is default)


       -P [nnn]
              Number of principal components to calculate {0}


              In  both  of the above, the corresponding principal component is
              written in the B-factor field of the output  PDB  file.  Usually
              only the first few PCs are of any interest (maybe up to six).

               EXAMPLES theseus 2sdf.pdb


       theseus -l -r new2sdf 2sdf.pdb


       theseus -s15-45 -P3 2sdf.pdb


       theseus  -A  cytc.aln  -M  cytc.mapfile  -o  cytc1.pdb -s1-40 cytc1.pdb
       cytc2.pdb cytc3.pdb cytc4.pdb

ENVIRONMENT
       You can set the environment variable 'PDBDIR' to your PDB  file  direc-
       tory  and  theseus will look there after the present working directory.
       For example, in the C shell (tcsh or csh), you can put  something  akin
       to this in your .cshrc file:

       setenv PDBDIR '/usr/share/pdbs/'


FILE FORMATS
       Theseus    will    read    standard    PDB    formatted    files   (see
       <http://www.rcsb.org/pdb/>).  Every effort has been made for  the  pro-
       gram to accept nonstandard CNS and X-PLOR file formats also.

       Two  other  files deserve mention, a sequence alignment file and a map-
       file.


   Sequence alignment file
       When superpositioning  structures  with  different  residue  identities
       (where  the lengths of each the macromolecules in terms of residues are
       not necessarily equal), a sequence alignment file must be included  for
       theseus  to  use  as  a  guide  (specified  by the -A option).  Theseus
       accepts both CLUSTAL and A2M (FASTA) formatted multiple sequence align-
       ment files.


       NOTE  1:  The  residue sequence in the alignment must match exactly the
       residue sequence given in the coordinates of the  PDB  file.  That  is,
       there can be no missing or extra residues that do not correspond to the
       sequence in the PDB file. An easy way to  ensure  that  your  sequences
       exactly match the PDB files is to generate the sequences using theseus'
       -F option, which writes out a FASTA  formatted  sequence  file  of  the
       chain(s)  in  the PDB files. The files output with this option can then
       be aligned with a multiple sequence alignment program such  as  CLUSTAL
       or  MUSCLE,  and  the  resulting  output alignment file used as theseus
       input with the -A option.


       NOTE 2: Every PDB file must have a corresponding sequence in the align-
       ment.   However,  not  every  sequence in the alignment needs to have a
       corresponding PDB file. That is, there can be extra  sequences  in  the
       alignment that are not used for guiding the superposition.


   PDB -> Sequence mapfile
       If  the  names  of  the  PDB  files  and the names of the corresponding
       sequences in the alignemnt are identical, the mapfile may  be  omitted.
       Otherwise,  Theseus needs to know which sequences in the alignment file
       correspond to which PDB structure files. This information  is  included
       in  a mapfile with a very simple format (specified with the -M option).
       There are only two columns separated by whitespace:  the  first  column
       lists  the  names  of  the PDB structure files, while the second column
       lists the corresponding sequence names exactly as given in the multiple
       sequence alignment file.

       An example of the mapfile:

       cytc1.pdb    seq1
       cytc2.pdb    seq2
       cytc3.pdb    seq3


SCREEN OUTPUT
       Theseus  provides output describing both the progress of the superposi-
       tioning and several statistics for the final result:


       Least-squares <sigma>:
              The standard deviation for the superposition, based on the  con-
              ventional  assumption  of  no  correlation  and equal variances.
              Basically equal to the RMSD from the average structure.


       Classical LS pairwise <RMSD>:
              The conventional RMSD for the superposition,  the  average  RMSD
              for all pairwise combinations of structures in the ensemble.


       Maximum Likelihood <sigma>:
              The  ML  analog of the standard deviation for the superposition.
              When assuming that the correlations are zero (a diagonal covari-
              ance  matrix),  this is equal to the square root of the harmonic
              average of the variances for each atom. In contrast, the 'Least-
              squares  <sigma>'  given  above  reports  the square root of the
              arithmetic average of the variances.  The  harmonic  average  is
              always  less than the arithmetic average, and the harmonic aver-
              age downweights large values proportional  to  their  magnitude.
              This  makes  sense  statistically, because when combining values
              one should weight them  by  the  reciprocal  of  their  variance
              (which is in fact what the ML superpositioning method does).


       Log Likelihood:
              The  final  log  likelihood  of  the superposition, assuming the
              matrix Gaussian distribution of the structures and  the  hierar-
              chical  inverse  gamma  distribution  of  the eigenvalues of the
              covariance matrix.


       AIC:   The Akaike Information Criterion for  the  final  superposition.
              This  is an important statistic in likelihood analysis and model
              selection theory. It allows an objective comparison of  multiple
              theoretical models with different numbers of parameters. In this
              case, the higher the number the  better.  There  is  a  tradeoff
              between  fit to the data and the number of parameters being fit.
              Increasing the number of parameters in a model will always  give
              a  better fit to the data, but it also increases the uncertainty
              of the estimated values.  The AIC criterion finds the best  com-
              bination  by  (1) maximizing the fit to the data while (2) mini-
              mizing the uncertainty due to the number of parameters.  In  the
              superposition case, one can compare the least squares superposi-
              tion to the maximum likelihood  superposition.  The  method  (or
              model) with the higher AIC is preferred. A difference in the AIC
              of 2 or more is considered strong statistical evidence  for  the
              better model.


       BIC:   The Bayesian Information Criterion. Similar to the AIC, but with
              a Bayesian emphasis.


       Rotational, translational, covar chi^2:
              The reduced chi-squared statistic for the fit of the  structures
              to  the model.  With a good fit it should be close to 1.0, which
              indicates a perfect fit of the data to  the  statistical  model.
              In  the  case  of  least-squares,  the assumed model is a matrix
              Gaussian distribution of the structures with equal variances and
              no correlations.  For the ML fits, the assumed models can either
              be (1) unequal variances and no correlations, as calculated with
              the  -v  option  [default] or (2) unequal variances and correla-
              tions, as calculated with the -c option.  This statistic is  for
              the  superposition  only,  and  does  not include the fit of the
              covariance matrix eigenvalues to an inverse gamma  distribution.
              See 'Omnibus chi^2' below.


       Hierarchical minimum var:
              The  hierarchical  fit  of  the  inverse gamma distribution con-
              strains the variances of the atoms by making large ones  smaller
              and  small ones larger.  This statistic reports the minimum pos-
              sible variance given the inferred inverse gamma parameters.


       Hierarchical var (alpha, gamma) chi^2:
              The reduced chi-squared for the inverse gamma fit of the covari-
              ance  matrix  eigenvalues. As before, it should ideally be close
              to 1.0.  The two values in the parentheses are the ML  estimates
              of the scale and shape parameters, respectively, for the inverse
              gamma distribtuion.


       Omnibus chi^2:
              The overall reduced chi-squared statistic for  the  entire  fit,
              including  the  rotations,  translations,  covariances,  and the
              inverse gamma parameters. This is probably  the  most  important
              statistic  for  the  superposition.  In  some cases, the inverse
              gamma fit may be poor, yet the overall fit is still  very  good.
              Again, it should ideally be close to 1.0, which would indicate a
              perfect fit. However, if you think it is too large, make sure to
              compare it to the chi^2 for the least-squares fit; it's probably
              not that bad after all.  A large chi^2 often indicates a  viola-
              tion of the assumptions of the model.  The most common violation
              is when superpositioning two or more  independent  domains  that
              can  rotate  relative  to  each other. If this is the case, then
              there will likely be not just  one  Gaussian  distribution,  but
              several mixed Gaussians, one for each domain.  Then, it would be
              better to superposition each domain independently.


       skewness, skewness Z-value, kurtosis & kurtosis Z-value:
              The skewness and kurtosis of the residuals. Both should  be  0.0
              if  the  residuals  fit a Gaussian distribution perfectly.  They
              are followed by the P-value for the statistics. This is  a  very
              stringent  test;  residuals can be very non-Gaussian and yet the
              estimated rotations, translations,  and  covariance  matrix  may
              still be rather accurate.


       FP error in transformed coordinates:
              The  empirically  determined floating point error in the coordi-
              nates after rotation and translation.


       Minimum RMSD error per atom:
              The empirically determined minimum RMSD error per atom, based on
              the floating point error of the computer.


       Data pts, Free params, D/P:
              The  total  number of data points given all observed structures,
              the number of parameters being fit in the model, and  the  data-
              to-parameter ratio.


       Median structure:
              The structure that is overall most similar to the average struc-
              ture. This can be considered to be the most "typical"  structure
              in the ensemble.


       Total rounds:
              The number of iterations that the algorithm took to converge.


       Fractional precision:
              The actual precision that the algorithm converged to.


OUTPUT FILES
       Theseus writes out the following files:


       theseus_sup.pdb
              The  final  superposition,  rotated to the principle axes of the
              mean structure.


       theseus_ave.pdb
              The estimate of the mean structure.


       theseus_cor.mat, theseus_cov.mat
              The atomic correlation matrix and covariance matrices, based  on
              the  final  superposition.  The  format is suitable for input to
              GNU's octave.  These are the matrices used in the Principal Com-
              ponents Analysis.


       theseus_embed_ave.pdb
              The  average structure as calculated by S. Lele's EDMA embedding
              algorithm, used as the starting point for the maximum likelihood
              iterations.


       theseus_residuals.txt
              The normalized residuals of the superposition. These can be ana-
              lyzed for deviations from normality (whether they fit a standard
              Gaussian  distribution). E.g., the chi^2, skewness, and kurtosis
              statistics are based on these values.


       theseus_transf.txt
              The final transformation rotation matrices and translation  vec-
              tors.


       theseus_variances.txt
              The vector of estimated variances for each atom.


       When Principal Components are calculated (with the -P option), the fol-
       lowing files are also produced:


       theseus_pcvecs.txt
              The principal component vectors.


       theseus_pcstats.txt
              Simple statistics for each principle component (loadings,  vari-
              ance explained, etc.).


       theseus_pcN_ave.pdb
              The  average  structure with the Nth principal component written
              in the temperature factor field.


       theseus_pcN.pdb
              The final superposition with the Nth principal component written
              in  the  temperature  factor  field.   This file is omitted when
              superpositioning  molecules  with  different  residue  sequences
              (mode 2).



BUGS
       Please send me (DLT) reports of all problems.


RESTRICTIONS
       Theseus  is  not  a  structural alignment program.  The structure-based
       alignment problem is completely different from the structural  superpo-
       sition  problem.  In order to do a structural superposition, there must
       be a 1-to-1 mapping that associates the atoms in one structure with the
       atoms  in  the other structures.  In the simplest case, this means that
       structures must have equivalent numbers of atoms, such as the models in
       an   NMR   PDB   file.    For  structures  with  different  numbers  of
       residues/atoms, superpositioning is only possible  when  the  sequences
       have  been  aligned  previously.   Finding  the best sequence alignment
       based on only structural information is a difficult  problem,  and  one
       for which there is currently no maximum likelihood approach.  Extending
       theseus to address the  structural  alignment  problem  is  an  ongoing
       research project.


AUTHOR
       Douglas L. Theobald
       dtheobald@brandeis.edu
       dtheobald@gmail.com


CITATION
       When using theseus in publications please cite the following:


       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "Empirical  Bayes models for regularizing maximum likelihood estimation
       in the matrix Gaussian Procrustes problem."
       PNAS 103(49):18521-18527


       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "THESEUS: Maximum likelihood superpositioning and  analysis  of  macro-
       molecular structures."
       Bioinformatics 22(17):2171-2172


       Douglas L. Theobald and Deborah S. Wuttke (2008)
       "Accurate  structural  correlations  from maximum likelihood superposi-
       tions."
       PLoS Computational Biology 4(2):e43 in press


HISTORY
       Long, tedious, and sordid.



Brandeis University            17 November 2008                     THESEUS(1)