lastal
======

This program finds local alignments between query sequences, and
reference sequences that have been prepared using lastdb.  You can use
it like this::

  lastdb humanDb humanChromosome*.fasta
  lastal humanDb dna*.fasta > myalns.maf

The lastdb command reads files called ``humanChromosome*.fasta`` and
writes several files whose names begin with ``humanDb``.  The lastal
command reads files called ``dna*.fasta``, compares them to
``humanDb``, and writes alignments to a file called ``myalns.maf``.

You can also pipe query sequences into lastal, for example::

  zcat seqs.fasta.gz | lastal humanDb > myalns.maf

Steps in lastal
---------------

1) Find initial matches.  For each possible start position in the
   query: find the shortest match with length ≥ l that *either* occurs
   ≤ m times in the reference, *or* has length L.

2) Extend a gapless alignment from each initial match, and keep those
   with score ≥ d.

3) Define cores: find the longest run of identical matches in each
   gapless alignment.

4) Extend a gapped alignment from either side of each core, and keep
   those with score ≥ e.

5) Non-redundantize the alignments: if several alignments share an
   endpoint (same coordinates in both sequences), remove all but one
   highest-scoring one.

6) Estimate the ambiguity of each aligned column (OFF by default).

7) Redo the alignments to minimize column ambiguity, using either
   gamma-centroid or LAMA (OFF by default).

Options
-------

Cosmetic options
~~~~~~~~~~~~~~~~

  -h, --help
      Show all options and their default settings, and exit.

  -V, --version
      Show version information, and exit.

  -v  Be verbose: write messages about what lastal is doing.

  -f NAME
      Choose the output format.  The NAME is not case-sensitive.

      **MAF** format looks like this::

        a score=15 EG2=4.7e+04 E=2.6e-05
        s chr3 9 23 + 939557 TTTGGGAGTTGAAGTTTTCGCCC
        s seqA 2 21 +     25 TTTGGGAGTTGAAGGTT--GCCC

      Lines starting with "s" contain: the sequence name, the start
      coordinate of the alignment, the number of sequence letters
      spanned by the alignment, the strand, the sequence length, and
      the aligned letters.  The start coordinates are zero-based.  If
      the strand is "-", the start coordinate is in the reverse
      strand.

      The same alignment in **TAB** format looks like this::

        15 chr3 9 23 + 939557 seqA 2 21 + 25 17,2:0,4 EG2=4.7e+04 E=2.6e-05

      The "17,2:0,4" shows the sizes and offsets of gapless blocks in
      the alignment.  In this case, we have a block of size 17, then
      an offset of size 2 in the upper sequence and 0 in the lower
      sequence, then a block of size 4.

      The same alignment in **BlastTab** format looks like this::

        seqA chr3 86.96 23 1 1 3 23 10 32 2.6e-05 44.3

      The fields are: query name, reference name, percent identity,
      alignment length, mismatches, gap opens, query start, query end,
      reference start, reference end, E-value, bit score.  The start
      coordinates are one-based.  *Warning:* this is a lossy format,
      because it does not show gap positions.  *Warning:* the other
      LAST programs cannot read this format.  *Warning:* `"bit score"
      is not the same as "score" <last-evalues.html>`_.

      **BlastTab+** format is the same as BlastTab, with 2 extra
      columns at the end: length of query sequence and length of
      reference sequence.  More columns might be added in future.

      For backwards compatibility, a NAME of 0 means TAB and 1 means
      MAF.

E-value options
~~~~~~~~~~~~~~~

  -D LENGTH
      Report alignments that are expected by chance at most once per
      LENGTH query letters.  This option only affects the default
      value of -E, so if you specify -E then -D has no effect.

  -E THRESHOLD
      Maximum EG2 (`expected alignments per square giga
      <last-evalues.html>`_).  This option only affects the default
      value of -e, so if you specify -e then -E has no effect.

Score options
~~~~~~~~~~~~~

  -r SCORE
      Match score.

  -q COST
      Mismatch cost.

  -p NAME
      Specify a match/mismatch score matrix.  Options -r and -q will
      be ignored.  The built-in matrices are described in
      `<last-matrices.html>`_.

      Any other NAME is assumed to be a file name.  For an example of
      the format, see the matrix files in the data directory.  Any
      letters that aren't in the matrix will get the lowest score in
      the matrix when aligned to anything.  Asymmetric scores are
      allowed: query letters correspond to columns and reference
      letters correspond to rows.  Other options can be specified on
      lines starting with "#last", but command line options override
      them.

  -a COST
      Gap existence cost.

  -b COST
      Gap extension cost.  A gap of size k costs: a + (b × k).

  -A COST
      Insertion existence cost.  This refers to insertions in the
      query relative to the reference.  If -A is not used, the
      insertion existence cost will equal the deletion existence cost,
      which is set by -a.

  -B COST
      Insertion extension cost.

  -c COST
      This option allows use of "generalized affine gap costs" (SF
      Altschul 1998, Proteins 32(1):88-96).  Here, a "gap" may consist
      of unaligned regions of both sequences.  If these unaligned
      regions have sizes j and k, where j ≤ k, the cost is: a +
      b⋅(k-j) + c⋅j.  If c ≥ a + 2b (the default), it reduces to
      standard affine gaps.

  -F COST
      Align DNA queries to protein reference sequences, using the
      specified frameshift cost.  A value of 15 seems to be
      reasonable.  (As a special case, -F0 means DNA-versus-protein
      alignment without frameshifts, which is faster.)  The output
      looks like this::

        a score=108
        s prot 2  40 + 649 FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD
        s dna  8 117 + 999 FFLQ-IKLWDP\STPH*IVSSP/PSDLISAHTLCPRMKSQDN

      The ``\`` indicates a forward shift by one nucleotide, and the
      ``/`` indicates a reverse shift by one nucleotide.  The ``*``
      indicates a stop codon.  The same alignment in tabular format
      looks like this::

        108 prot 2 40 + 649 dna 8 117 + 999 4,1:0,6,0:1,10,0:-1,19

      The "-1" indicates the reverse frameshift.

  -x DROP
      Maximum score drop for gapped alignments.  Gapped alignments are
      forbidden from having any internal region with score < -DROP.
      This serves two purposes: accuracy (avoid spurious internal
      regions in alignments) and speed (the smaller the faster).

  -y DROP
      Maximum score drop for gapless alignments.

  -z DROP
      Maximum score drop for final gapped alignments.  Setting z
      different from x causes lastal to extend gapless alignments
      twice: first with a maximum score drop of x, and then with a
      (presumably higher) maximum score drop of z.

  -d SCORE
      Minimum score for gapless alignments.

  -e SCORE
      Minimum alignment score.  (If you do gapless alignment with
      option -j1, then -d and -e mean the same thing.  If you set
      both, -e will prevail.)

Initial-match options
~~~~~~~~~~~~~~~~~~~~~

  -m MULTIPLICITY
      Maximum multiplicity for initial matches.  Each initial match is
      lengthened until it occurs at most this many times in the
      reference.

      If the reference was split into volumes by lastdb, then lastal
      uses one volume at a time.  The maximum multiplicity then
      applies to each volume, not the whole reference.  This is why
      voluming changes the results.

  -l LENGTH
      Minimum length for initial matches.  Length means the number of
      letters spanned by the match.

  -L LENGTH
      Maximum length for initial matches.

  -k STEP
      Look for initial matches starting only at every STEP-th position
      in each query (positions 0, STEP, 2×STEP, etc).  This makes
      lastal faster but less sensitive.

  -W SIZE
      Look for initial matches starting only at query positions that
      are "minimum" in any window of SIZE consecutive positions (see
      `<lastdb.html>`_).  By default, this parameter takes the same
      value as was used for lastdb -W.

Miscellaneous options
~~~~~~~~~~~~~~~~~~~~~

  -s STRAND
      Specify which query strand should be used: 0 means reverse only,
      1 means forward only, and 2 means both.

  -S NUMBER
      Specify which DNA strand the score matrix applies to.  This
      matters only for unusual matrices that lack strand symmetry
      (e.g. if the a:g score differs from the t:c score).  0 means
      that the matrix applies to the forward strand of the reference
      aligned to either strand of the query.  1 means that the matrix
      applies to either strand of the reference aligned to the forward
      strand of the query.

  -K LIMIT
      Omit any alignment whose query range lies in LIMIT or more other
      alignments with higher score (and on the same strand).  This is
      a useful way to get just the top few hits to each part of each
      query (P Berman et al. 2000, J Comput Biol 7:293-302).

  -C LIMIT
      Before extending gapped alignments, discard any gapless
      alignment whose query range lies in LIMIT or more others (for
      the same strand and volume) with higher score-per-length.  This
      can reduce run time and output size (MC Frith & R Kawaguchi
      2015, Genome Biol 16:106).

  -P THREADS
      Divide the work between this number of threads running in
      parallel.  0 means use as many threads as your computer claims
      it can handle simultaneously.  Single query sequences are not
      divided between threads, so you need multiple queries per batch
      for this option to take effect.

  -i BYTES
      Search queries in batches of at most this many bytes.  If a
      single sequence exceeds this amount, however, it is not split.
      You can use suffixes K, M, and G to specify KibiBytes,
      MebiBytes, and GibiBytes.  This option has no effect on the
      results (apart from their order).

      If the reference was split into volumes by lastdb, then each
      volume will be read into memory once per query batch.

  -M  Find minimum-difference alignments, which is faster but cruder.
      This treats all matches the same, and minimizes the number of
      differences (mismatches plus gaps).

      * Any substitution score matrix will be ignored.  The
        substitution scores are set by the match score (r) and the
        mismatch cost (q).
      * The gap cost parameters will be ignored.  The gap existence
        cost will be 0 and the gap extension cost will be q + r/2.
      * The match score (r) must be an even number.
      * Any sequence quality data (e.g. fastq) will be ignored.

  -T NUMBER
      Type of alignment: 0 means "local alignment" and 1 means
      "overlap alignment".  Local alignments can end anywhere in the
      middle or at the ends of the sequences.  Overlap alignments must
      extend to the left until they hit the end of a sequence (either
      query or reference), and to the right until they hit the end of
      a sequence.

      **Warning:** it's often a bad idea to use -T1.  This setting
      does not change the maximum score drop allowed inside
      alignments, so if an alignment cannot be extended to the end of
      a sequence without exceeding this drop, it will be discarded.

  -n COUNT
      Maximum number of gapless alignments per query position.  When
      lastal extends gapless alignments from initial matches that
      start at one query position, if it gets COUNT successful
      extensions, it skips any remaining initial matches starting at
      that position.

  -R DIGITS
      Specify lowercase-marking of repeats, by two digits (e.g. "-R 01"),
      with the following meanings.

      First digit:

      0. Convert the input sequences to uppercase while reading them.
      1. Keep any lowercase in the input sequences.

      Second digit:

      0. Do not check for simple repeats.
      1. Convert simple repeats (e.g. cacacacacacacacac) to lowercase.
      2. Convert simple repeats, within AT-rich DNA, to lowercase.

      Details: Tantan is applied separately to forward and reverse
      strands.  For DNA-versus-protein alignment (option -F), it is
      applied to the DNA after translation, at the protein level.

  -u NUMBER
      Specify treatment of lowercase letters when extending
      alignments:

      0. Mask them for neither gapless nor gapped extensions.
      1. Mask them for gapless but not gapped extensions.
      2. Mask them for gapless but not gapped extensions, and then
         discard alignments that lack any segment with score ≥ e when
         lowercase is masked.
      3. Mask them for gapless and gapped extensions.

      "Mask" means change their match/mismatch scores to min(unmasked
      score, 0).  This option does not affect treatment of lowercase
      for initial matches.

  -w DISTANCE
      This option is a kludge to avoid catastrophic time and memory
      usage when self-comparing a large sequence.  If the sequence
      contains a tandem repeat, we may get a gapless alignment that is
      slightly offset from the main self-alignment.  In that case, the
      gapped extension might "discover" the main self-alignment and
      extend over the entire length of the sequence.

      To avoid this problem, gapped alignments are not triggered from
      any gapless alignment that:

      * is contained, in both sequences, in the "core" of another
        alignment
      * has start coordinates offset by DISTANCE or less relative to
        this core

      Use -w0 to turn this off.

  -G FILE
      Use an alternative genetic code in the specified file.  For an
      example of the format, see vertebrateMito.gc in the examples
      directory.  By default, the standard genetic code is used.  This
      option has no effect unless DNA-versus-protein alignment is
      selected with option -F.

  -t TEMPERATURE
      Parameter for converting between scores and likelihood ratios.
      This affects the column ambiguity estimates.  A score is
      converted to a likelihood ratio by this formula: exp(score /
      TEMPERATURE).  The default value is 1/lambda, where lambda is
      the scale factor of the scoring matrix, which is calculated by
      the method of Yu and Altschul (YK Yu et al. 2003, PNAS
      100(26):15688-93).

  -g GAMMA
      This option affects gamma-centroid and LAMA alignment only.

      Gamma-centroid alignments minimize the ambiguity of paired
      letters.  In fact, this method aligns letters whose column error
      probability is less than GAMMA/(GAMMA+1).  When GAMMA is low, it
      aligns confidently-paired letters only, so there tend to be many
      unaligned letters.  When GAMMA is high, it aligns letters more
      liberally.

      LAMA (Local Alignment Metric Accuracy) alignments minimize the
      ambiguity of columns (both paired letters and gap columns).
      When GAMMA is low, this method produces shorter alignments with
      more-confident columns, and when GAMMA is high it produces
      longer alignments including less-confident columns.

      In summary: to get the most accurately paired letters, use
      gamma-centroid.  To get accurately placed gaps, use LAMA.

      Note that the reported alignment score is that of the ordinary
      gapped alignment before realigning with gamma-centroid or LAMA.

  -j NUMBER
      Output type: 0 means counts of initial matches (of all lengths);
      1 means gapless alignments; 2 means gapped alignments before
      non-redundantization; 3 means gapped alignments after
      non-redundantization; 4 means alignments with ambiguity
      estimates; 5 means gamma-centroid alignments; 6 means LAMA
      alignments; 7 means alignments with expected counts.

      If you use -j0, lastal will count the number of initial matches,
      per length, per query sequence.  Options -l and -L will set the
      minimum and maximum lengths, and -m will be ignored.  If you
      compare a large sequence to itself with -j0, it's wise to set
      option -L.

      If you use j>3, each alignment will get a "fullScore" (also
      known as "forward score" or "sum-of-paths score").  This is like
      the score, but it takes into account alternative alignments.

      If you use -j7, lastal will print an extra MAF line starting
      with "c" for each alignment.  The first 16 numbers on this line
      are the expected counts of matches and mismatches: first the
      count of reference As aligned to query As, then the count of
      reference As aligned to query Cs, and so on.  For proteins there
      will be 400 such numbers.  The final ten numbers are expected
      counts related to gaps.  They are:

      * The count of matches plus mismatches.  (This may exceed the
        total of the preceding numbers, if the sequences have non-ACGT
        letters.)
      * The count of deleted letters.
      * The count of inserted letters.
      * The count of delete opens (= count of delete closes).
      * The count of insert opens (= count of insert closes).
      * The count of adjacent pairs of insertions and deletions.

      The final four numbers are always zero, unless you use
      generalized affine gap costs.  They are:

      * The count of unaligned letter pairs.
      * The count of unaligned letter pair opens (= count of closes).
      * The count of adjacent pairs of deletions and unaligned letter pairs.
      * The count of adjacent pairs of insertions and unaligned letter pairs.

  -Q NUMBER
      This option allows lastal to use sequence quality scores, or
      PSSMs, for the queries.  0 means read queries in fasta format
      (without quality scores); 1 means fastq-sanger format; 2 means
      fastq-solexa format; 3 means fastq-illumina format; 4 means prb
      format; 5 means read PSSMs.  (*Warning*: Illumina data is not
      necessarily in fastq-illumina format; it is often in
      fastq-sanger format.)

      The fastq formats look like this::

        @mySequenceName
        TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
        +
        55555555*&5-/55*5//5(55,5#&$)$)*+$

      The "+" may optionally be followed by a name (ignored), and the
      sequence and quality codes are allowed to wrap onto more than
      one line.  For fastq-sanger, the quality scores are obtained by
      subtracting 33 from the ASCII values of the characters below the
      "+".  For fastq-solexa and fastq-illumina, they are obtained by
      subtracting 64.

      prb format stores four quality scores (A, C, G, T) per position,
      with one sequence per line, like this::

        -40  40 -40 -40      -12   1 -12  -3      -10  10 -40 -40

      Since prb does not store sequence names, lastal uses the line
      number (starting from 1) as the name.

      In fastq-sanger and fastq-illumina format, the quality scores
      are related to error probabilities like this: qScore =
      -10⋅log10[p].  In fastq-solexa and prb, however, qScore =
      -10⋅log10[p/(1-p)].  In lastal's MAF output, the quality scores
      are written on lines starting with "q".  For fastq, they are
      written with the same encoding as the input.  For prb, they are
      written in the fastq-solexa (ASCII-64) encoding.

      Finally, PSSM means "position-specific scoring matrix".  The
      format is::

        myLovelyPSSM
             A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
        1 M -2 -2 -3 -4 -2 -1 -3 -3 -2  1  2 -2  8 -1 -3 -2 -1 -2 -2  0
        2 S  0 -2  0  1  3 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2  5  0 -4 -3 -2
        3 D -1 -2  0  7 -4 -1  1 -2 -2 -4 -4 -2 -4 -4 -2 -1 -2 -5 -4 -4

      The sequence appears in the second column, and columns 3 onwards
      contain the position-specific scores.  Any letters not specified
      by any column will get the lowest score in each row.  This
      format is a simplified version of PSI-BLAST's ASCII format: the
      non-simplified version is allowed too.

      *Warning*: lastal cannot directly calculate E-values for PSSMs.
      The E-values (and the default value of -y) are determined by the
      otherwise-unused match and mismatch scores (options -r -q and
      -p).  There is evidence these E-values will be accurate if the
      PSSM is "constructed to the same scale" as the match/mismatch
      scores (SF Altschul et al. 1997, NAR 25(17):3389-402).

Parallel processes and memory sharing
-------------------------------------

If you run several lastal commands (i.e. processes) at the same time
on the same computer, using the same set of reference files prepared
by lastdb, then they will share memory for the reference files.

Multiple volumes
----------------

If lastdb creates multiple volumes::

  lastdb hugeDb huge.fasta

You can either run lastal on the whole thing::

  lastal hugeDb queries.fasta > myalns.maf

Or on one volume at a time::

  lastal hugeDb0 queries.fasta > myalns0.maf
  lastal hugeDb1 queries.fasta > myalns1.maf
  lastal hugeDb2 queries.fasta > myalns2.maf

The former method reads the queries in large batches, and aligns each
batch to one volume at a time.  If you run several processes in
parallel, they will not necessarily use the same volume at the same
time.

Therefore, with parallel processes, you should either ensure you have
enough memory to hold several volumes simultaneously, or run lastal on
one volume at a time.  An efficient scheme is to use a different
computer for each volume.

lastal8
-------

lastal8 has identical usage to lastal, and is used with `lastdb8
<lastdb.html>`_.  lastal cannot read the output of lastdb8, and
lastal8 cannot read the output of lastdb.
