bwa(1)                       Bioinformatics tools                       bwa(1)

       bwa - Burrows-Wheeler Alignment Tool

       bwa index ref.fa

       bwa mem ref.fa reads.fq > aln-se.sam

       bwa mem ref.fa read1.fq read2.fq > aln-pe.sam

       bwa aln ref.fa short_read.fq > aln_sa.sai

       bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam

       bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam

       bwa bwasw ref.fa long_read.fq > aln.sam

       BWA is a software package for mapping low-divergent sequences against a
       large  reference genome, such as the human genome. It consists of three
       algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first  algorithm  is
       designed  for  Illumina  sequence reads up to 100bp, while the rest two
       for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share
       similar  features  such  as  long-read support and split alignment, but
       BWA-MEM, which is the latest, is generally recommended for high-quality
       queries  as  it  is  faster and more accurate.  BWA-MEM also has better
       performance than BWA-backtrack for 70-100bp Illumina reads.

       For all the algorithms, BWA first needs to construct the  FM-index  for
       the  reference  genome  (the  index  command). Alignment algorithms are
       invoked with different sub-commands: aln/samse/sampe for BWA-backtrack,
       bwasw for BWA-SW and mem for the BWA-MEM algorithm.

       index  bwa index [-p prefix] [-a algoType] db.fa

              Index database sequences in the FASTA format.


              -p STR    Prefix of the output database [same as db filename]

              -a STR    Algorithm  for  constructing BWT index. BWA implements
                        two algorithms for BWT  construction:  is  and  bwtsw.
                        The  first  algorithm  is  a  little  faster for small
                        database but requires large RAM and does not work  for
                        databases  with total length longer than 2GB. The sec-
                        ond algorithm is adapted from the BWT-SW source  code.
                        It  in  theory  works  with database with trillions of
                        bases. When this option is not specified,  the  appro-
                        priate algorithm will be chosen automatically.

       mem    bwa  mem  [-aCHMpP] [-t nThreads] [-k minSeedLen] [-w bandWidth]
              [-d zDropoff] [-r seedSplitRatio] [-c  maxOcc]  [-A  matchScore]
              [-B  mmPenalty]  [-O gapOpenPen] [-E gapExtPen] [-L clipPen] [-U
              unpairPen] [-R  RGline]  [-v  verboseLevel]  db.prefix  reads.fq

              Align  70bp-1Mbp  query  sequences  with  the BWA-MEM algorithm.
              Briefly, the algorithm works by seeding alignments with  maximal
              exact  matches  (MEMs) and then extending seeds with the affine-
              gap Smith-Waterman algorithm (SW).

              If mates.fq file is absent and option -p is not set,  this  com-
              mand regards input reads are single-end. If mates.fq is present,
              this command assumes the i-th read in reads.fq and the i-th read
              in  mates.fq  constitute a read pair. If -p is used, the command
              assumes the 2i-th and the (2i+1)-th read in reads.fq  constitute
              a read pair (such input file is said to be interleaved). In this
              case, mates.fq is ignored. In the paired-end mode, the mem  com-
              mand will infer the read orientation and the insert size distri-
              bution from a batch of reads.

              The BWA-MEM algorithm performs local alignment. It  may  produce
              multiple  primary  alignments  for  different  part  of  a query
              sequence. This is a crucial feature for long sequences. However,
              some  tools  such  as Picard's markDuplicates does not work with
              split alignments. One may consider to  use  option  -M  to  flag
              shorter split hits as secondary.


              -t INT    Number of threads [1]

              -k INT    Minimum  seed length. Matches shorter than INT will be
                        missed. The alignment speed is usually insensitive  to
                        this  value  unless it significantly deviates from 20.

              -w INT    Band width. Essentially, gaps longer than INT will not
                        be  found.  Note  that  the maximum gap length is also
                        affected by the scoring matrix and the hit length, not
                        solely determined by this option. [100]

              -d INT    Off-diagonal  X-dropoff  (Z-dropoff).  Stop  extension
                        when the difference between the best and  the  current
                        extension  score  is  above |i-j|*A+INT, where i and j
                        are the current positions of the query and  reference,
                        respectively,  and  A is the matching score. Z-dropoff
                        is similar to BLAST's X-dropoff except that it doesn't
                        penalize  gaps  in  one of the sequences in the align-
                        ment. Z-dropoff not only avoids unnecessary extension,
                        but  also  reduces  poor alignments inside a long good
                        alignment. [100]

              -r FLOAT  Trigger  re-seeding  for  a  MEM  longer   than   min-
                        SeedLen*FLOAT.   This is a key heuristic parameter for
                        tuning the  performance.  Larger  value  yields  fewer
                        seeds, which leads to faster alignment speed but lower
                        accuracy. [1.5]

              -c INT    Discard a MEM if it has more than INT occurence in the
                        genome. This is an insensitive parameter. [10000]

              -P        In  the  paired-end mode, perform SW to rescue missing
                        hits only but do not try  to  find  hits  that  fit  a
                        proper pair.

              -A INT    Matching score. [1]

              -B INT    Mismatch  penalty. The sequence error rate is approxi-
                        mately: {.75 * exp[-log(4) * B/A]}. [4]

              -O INT    Gap open penalty. [6]

              -E INT    Gap extension penalty. A gap of length k costs O + k*E
                        (i.e.  -O is for opening a zero-length gap). [1]

              -L INT[,INT]
                        Clipping  penalty.  When performing SW extension, BWA-
                        MEM keeps track of the best score reaching the end  of
                        query.  If this score is larger than the best SW score
                        minus the  clipping  penalty,  clipping  will  not  be
                        applied.  Note  that  in  this  case,  the  SAM AS tag
                        reports the best SW score;  clipping  penalty  is  not
                        deduced. If two numbers are provided, the first is for
                        5'-end clipping and second for 3'-end clipping. [5]

              -U INT    Penalty for an unpaired read pair. BWA-MEM  scores  an
                        unpaired  read  pair  as scoreRead1+scoreRead2-INT and
                        scores  a  paired   as   scoreRead1+scoreRead2-insert-
                        Penalty.  It  compares  these  two scores to determine
                        whether we should force pairing. A larger value  leads
                        to more aggressive read pair. [17]

              -p        Assume  the  first  input  query  file  is interleaved
                        paired-end FASTA/Q. See the  command  description  for

              -R STR    Complete  read  group header line. '\t' can be used in
                        STR and will be converted to a TAB in the output  SAM.
                        The  read  group  ID will be attached to every read in
                        the  output.  An  example  is   '@RG\tID:foo\tSM:bar'.

              -T INT    Don't  output  alignment  with  score  lower than INT.
                        This option affects output and occasionally  SAM  flag
                        2. [30]

              -a        Output all found alignments for single-end or unpaired
                        paired-end reads. These alignments will be flagged  as
                        secondary alignments.

              -C        Append  append  FASTA/Q  comment  to  SAM output. This
                        option can be used to transfer read  meta  information
                        (e.g.  barcode)  to  the  SAM  output.  Note  that the
                        FASTA/Q comment (the  string  after  a  space  in  the
                        header   line)   must   conform  the  SAM  spec  (e.g.
                        BC:Z:CGTAC). Malformated comments  lead  to  incorrect
                        SAM output.

              -M        Mark  shorter split hits as secondary (for Picard com-

              -v INT    Control the verbose level of the output.  This  option
                        has  not been fully supported throughout BWA. Ideally,
                        a value 0 for disabling all the output  to  stderr;  1
                        for outputting errors only; 2 for warnings and errors;
                        3 for all normal messages; 4 or higher for  debugging.
                        When this option takes value 4, the output is not SAM.

       aln    bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i
              nIndelEnd]  [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M
              misMsc] [-O gapOsc]  [-E  gapEsc]  [-q  trimQual]  

              Find  the SA coordinates of the input reads. Maximum maxSeedDiff
              differences are allowed in the  first  seedLen  subsequence  and
              maximum maxDiff differences are allowed in the whole sequence.


              -n NUM    Maximum  edit  distance  if  the  value is INT, or the
                        fraction of missing alignments given 2%  uniform  base
                        error  rate  if FLOAT. In the latter case, the maximum
                        edit distance is automatically  chosen  for  different
                        read lengths. [0.04]

              -o INT    Maximum number of gap opens [1]

              -e INT    Maximum  number of gap extensions, -1 for k-difference
                        mode (disallowing long gaps) [-1]

              -d INT    Disallow a long deletion within  INT  bp  towards  the
                        3'-end [16]

              -i INT    Disallow an indel within INT bp towards the ends [5]

              -l INT    Take  the  first  INT  subsequence  as seed. If INT is
                        larger than the query sequence, seeding will  be  dis-
                        abled. For long reads, this option is typically ranged
                        from 25 to 35 for '-k 2'. [inf]

              -k INT    Maximum edit distance in the seed [2]

              -t INT    Number of threads (multi-threading mode) [1]

              -M INT    Mismatch penalty. BWA will not search  for  suboptimal
                        hits with a score lower than (bestScore-misMsc). [3]

              -O INT    Gap open penalty [11]

              -E INT    Gap extension penalty [4]

              -R INT    Proceed  with  suboptimal  alignments  if there are no
                        more than INT equally  best  hits.  This  option  only
                        affects  paired-end mapping. Increasing this threshold
                        helps to improve the pairing accuracy at the  cost  of
                        speed, especially for short reads (~32bp).

              -c        Reverse query but not complement it, which is required
                        for alignment in  the  color  space.  (Disabled  since

              -N        Disable  iterative  search. All hits with no more than
                        maxDiff differences will be found. This mode  is  much
                        slower than the default.

              -q INT    Parameter  for read trimming. BWA trims a read down to
                        argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l 1.sai
                            bwa aln ref.fa -b2 reads.bam > 2.sai
                            bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam >

              -0        When -b is specified, only  use  single-end  reads  in

              -1        When  -b  is  specified,  only use the first read in a
                        read pair in mapping (skip single-end  reads  and  the
                        second reads).

              -2        When  -b  is  specified, only use the second read in a
                        read pair in mapping.

       samse  bwa samse [-n maxOcc]    > 

              Generate  alignments  in  the SAM format given single-end reads.
              Repetitive hits will be randomly chosen.


              -n INT    Maximum number of alignments to output in the  XA  tag
                        for reads paired properly. If a read has more than INT
                        hits, the XA tag will not be written. [3]

              -r STR    Specify   the   read   group   in   a   format    like
                        '@RG\tID:foo\tSM:bar'. [null]

       sampe  bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N max-
              HitDis] [-P]     

              Generate  alignments  in  the SAM format given paired-end reads.
              Repetitive read pairs will be placed randomly.


              -a INT  Maximum insert size for a read  pair  to  be  considered
                      being  mapped properly. Since 0.4.5, this option is only
                      used when there are not enough good alignment  to  infer
                      the distribution of insert sizes. [500]

              -o INT  Maximum  occurrences  of a read for pairing. A read with
                      more occurrneces will be treated as a  single-end  read.
                      Reducing this parameter helps faster pairing. [100000]

              -P      Load  the  entire  FM-index  into  memory to reduce disk
                      operations (base-space reads only). With this option, at
                      least 1.25N bytes of memory are required, where N is the
                      length of the genome.

              -n INT  Maximum number of alignments to output in the XA tag for
                      reads paired properly. If a read has more than INT hits,
                      the XA tag will not be written. [3]

              -N INT  Maximum number of alignments to output in the XA tag for
                      disconcordant  read  pairs  (excluding singletons). If a
                      read has more than INT hits, the  XA  tag  will  not  be
                      written. [10]

              -r STR  Specify    the    read    group   in   a   format   like
                      '@RG\tID:foo\tSM:bar'. [null]

       bwasw  bwa  bwasw  [-a  matchScore]  [-b  mmPen]  [-q  gapOpenPen]  [-r
              gapExtPen]  [-t nThreads] [-w bandWidth] [-T thres] [-s hspIntv]
              [-z zBest] [-N nHspRev]  [-c  thresCoef]    

              Align  query  sequences  in  the  in.fq  file.  When  mate.fq is
              present, perform paired-end alignment. The paired-end mode  only
              works  for reads Illumina short-insert libraries. In the paired-
              end mode, BWA-SW may still output split alignments but they  are
              all  marked  as not properly paired; the mate positions will not
              be written if the mate has multiple local hits.


              -a INT    Score of a match [1]

              -b INT    Mismatch penalty [3]

              -q INT    Gap open penalty [5]

              -r INT    Gap extension penalty. The penalty  for  a  contiguous
                        gap of size k is q+k*r. [2]

              -t INT    Number of threads in the multi-threading mode [1]

              -w INT    Band width in the banded alignment [33]

              -T INT    Minimum score threshold divided by a [37]

              -c FLOAT  Coefficient  for  threshold  adjustment  according  to
                        query length. Given an l-long query, the threshold for
                        a hit to be retained is a*max{T,c*log(l)}. [5.5]

              -z INT    Z-best heuristics. Higher -z increases accuracy at the
                        cost of speed. [1]

              -s INT    Maximum SA interval size for initiating a seed. Higher
                        -s increases accuracy at the cost of speed. [3]

              -N INT    Minimum  number  of  seeds  supporting  the  resultant
                        alignment to skip reverse alignment. [5]

       The output of the 'aln' command is binary  and  designed  for  BWA  use
       only.  BWA  outputs  the  final  alignment  in the SAM (Sequence Align-
       ment/Map) format. Each line consists of:

       |Col | Field |                       Description                        |
       | 1  | QNAME | Query (pair) NAME                                        |
       | 2  | FLAG  | bitwise FLAG                                             |
       | 3  | RNAME | Reference sequence NAME                                  |
       | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
       | 5  | MAPQ  | MAPping Quality (Phred-scaled)                           |
       | 6  | CIAGR | extended CIGAR string                                    |
       | 7  | MRNM  | Mate Reference sequence NaMe ('=' if same as RNAME)      |
       | 8  | MPOS  | 1-based Mate POSistion                                   |
       | 9  | ISIZE | Inferred insert SIZE                                     |
       |10  | SEQ   | query SEQuence on the same strand as the reference       |
       |11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
       |12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |

       Each bit in the FLAG field is defined as:

               |Chr |  Flag  |              Description              |
               | p  | 0x0001 | the read is paired in sequencing      |
               | P  | 0x0002 | the read is mapped in a proper pair   |
               | u  | 0x0004 | the query sequence itself is unmapped |
               | U  | 0x0008 | the mate is unmapped                  |
               | r  | 0x0010 | strand of the query (1 for reverse)   |
               | R  | 0x0020 | strand of the mate                    |
               | 1  | 0x0040 | the read is the first read in a pair  |
               | 2  | 0x0080 | the read is the second read in a pair |
               | s  | 0x0100 | the alignment is not primary          |
               | f  | 0x0200 | QC failure                            |
               | d  | 0x0400 | optical or PCR duplicate              |

       The Please check  for the format spec-
       ification and the tools for post-processing the alignment.

       BWA generates the following optional fields. Tags starting with 'X' are
       specific to BWA.

            |Tag |                        Meaning                        |
            |NM  | Edit distance                                         |
            |MD  | Mismatching positions/bases                           |
            |AS  | Alignment score                                       |
            |BC  | Barcode sequence                                      |
            |SA  | Supplementary alignments                              |
            |X0  | Number of best hits                                   |
            |X1  | Number of suboptimal hits found by BWA                |
            |XN  | Number of ambiguous bases in the referenece           |
            |XM  | Number of mismatches in the alignment                 |
            |XO  | Number of gap opens                                   |
            |XG  | Number of gap extentions                              |
            |XT  | Type: Unique/Repeat/N/Mate-sw                         |
            |XA  | Alternative hits; format: /(chr,pos,CIGAR,NM;)*/      |
            |XS  | Suboptimal alignment score                            |
            |XF  | Support from forward/reverse alignment                |
            |XE  | Number of supporting seeds                            |
            |XP  | Alt primary hits; format: /(chr,pos,CIGAR,mapQ,NM;)+/ |

       Note that XO and XG are generated by BWT search while the CIGAR  string
       by  Smith-Waterman  alignment.  These two tags may be inconsistent with
       the CIGAR string. This is not a bug.

   Alignment Accuracy
       When seeding is disabled, BWA guarantees to find an alignment  contain-
       ing  maximum  maxDiff  differences including maxGapO gap opens which do
       not occur within nIndelEnd bp towards either end of the  query.  Longer
       gaps  may  be found if maxGapE is positive, but it is not guaranteed to
       find all hits. When seeding is enabled, BWA further requires  that  the
       first  seedLen  subsequence  contains  no more than maxSeedDiff differ-

       When gapped alignment is disabled, BWA is expected to generate the same
       alignment  as Eland version 1, the Illumina alignment program. However,
       as BWA change 'N' in the database sequence to random nucleotides,  hits
       to  these  random sequences will also be counted. As a consequence, BWA
       may mark a unique hit as a repeat, if the random sequences happen to be
       identical to the sequences which should be unqiue in the database.

       By  default,  if  the  best hit is not highly repetitive (controlled by
       -R), BWA also finds all hits contains one more mismatch; otherwise, BWA
       finds  all  equally  best  hits only. Base quality is NOT considered in
       evaluating hits. In the paired-end mode, BWA pairs all hits  it  found.
       It further performs Smith-Waterman alignment for unmapped reads to res-
       cue reads with a high erro rate, and for high-quality  anomalous  pairs
       to fix potential alignment errors.

   Estimating Insert Size Distribution
       BWA  estimates the insert size distribution per 256*1024 read pairs. It
       first collects pairs of reads with both ends mapped with  a  single-end
       quality  20 or higher and then calculates median (Q2), lower and higher
       quartile (Q1 and Q3). It estimates the mean and  the  variance  of  the
       insert  size  distribution  from  pairs  whose  insert sizes are within
       interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a  pair
       considered  to be properly paired (SAM flag 0x2) is calculated by solv-
       ing equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean,  sigma  is
       the  standard error of the insert size distribution, L is the length of
       the genome, p0 is prior of anomalous pair and  Phi()  is  the  standard
       cumulative  distribution  function.  For  mapping Illumina short-insert
       reads to the human genome, x is about 6-7 sigma  away  from  the  mean.
       Quartiles,  mean,  variance and x will be printed to the standard error

   Memory Requirement
       With bwtsw algorithm, 5GB memory is required for indexing the  complete
       human  genome  sequences.  For short reads, the aln command uses ~3.2GB
       memory and the sampe command uses ~5.4GB.

       Indexing the human genome sequences takes 3 hours with bwtsw algorithm.
       Indexing  smaller  genomes  with  IS algorithms is faster, but requires
       more memory.

       The speed of alignment is largely determined by the error rate  of  the
       query  sequences  (r).  Firstly,  BWA runs much faster for near perfect
       hits than for hits with many differences, and it stops searching for  a
       hit with l+2 differences if a l-difference hit is found. This means BWA
       will be very slow if r is high because in this case BWA  has  to  visit
       hits  with  many  differences  and looking for these hits is expensive.
       Secondly, the alignment algorithm behind makes the speed  sensitive  to
       [k log(N)/m], where k is the maximum allowed differences, N the size of
       database and m the length of a query. In practice, we choose k w.r.t. r
       and therefore r is the leading factor. I would not recommend to use BWA
       on data with r>0.02.

       Pairing is slower for shorter reads. This  is  mainly  because  shorter
       reads  have more spurious hits and converting SA coordinates to chromo-
       somal coordinates are very costly.

       Since version 0.6, BWA has been able to work with  a  reference  genome
       longer  than 4GB.  This feature makes it possible to integrate the for-
       ward and reverse complemented genome in one FM-index, which  speeds  up
       both  BWA-short and BWA-SW. As a tradeoff, BWA uses more memory because
       it has to keep all positions and ranks in 64-bit integers, twice larger
       than 32-bit integers used in the previous versions.

       The latest BWA-SW also works for paired-end reads longer than 100bp. In
       comparison to BWA-short, BWA-SW tends to be more  accurate  for  highly
       unique  reads  and  more  robust to relative long INDELs and structural
       variants.  Nonetheless, BWA-short usually has higher power  to  distin-
       guish the optimal hit from many suboptimal hits. The choice of the map-
       ping algorithm may depend on the application.

       BWA   website   ,   Samtools    website

       Heng  Li  at  the Sanger Institute wrote the key source codes and inte-
       grated   the   following   codes   for    BWT    construction:    bwtsw
       ,  implemented by Chi-Kwong Wong at
       the       University       of       Hong       Kong       and        IS
         originally  proposed by Nong Ge
        at the  Sun  Yat-Sen  University  and
       implemented by Yuta Mori.

       The full BWA package is distributed under GPLv3 as it uses source codes
       from BWT-SW which is covered by GPL. Sorting, hash table,  BWT  and  IS
       libraries are distributed under the MIT license.

       If  you  use  the  BWA-backtrack  algorithm,  please cite the following

       Li H. and Durbin R. (2009) Fast and accurate short read alignment  with
       Burrows-Wheeler   transform.   Bioinformatics,  25,  1754-1760.  [PMID:

       If you use the BWA-SW algorithm, please cite:

       Li H. and Durbin R. (2010) Fast and accurate long-read  alignment  with
       Burrows-Wheeler   transform.   Bioinformatics,   26,   589-595.  [PMID:

       If you use BWA-MEM or the fastmap component of BWA, please cite:

       Li H. (2013) Aligning sequence reads, clone sequences and assembly con-
       tigs with BWA-MEM. arXiv:1303.3997v1 [q-bio.GN].

       It  is  likely  that  the BWA-MEM manuscript will not appear in a peer-
       reviewed journal.

       BWA is largely influenced by BWT-SW. It uses source codes  from  BWT-SW
       and  mimics its binary file formats; BWA-SW resembles BWT-SW in several
       ways. The initial idea about BWT-based alignment  also  came  from  the
       group  who  developed BWT-SW. At the same time, BWA is different enough
       from BWT-SW. The short-read alignment algorithm bears no similarity  to
       Smith-Waterman  algorithm any more. While BWA-SW learns from BWT-SW, it
       introduces heuristics that can hardly be applied to the original  algo-
       rithm.  In  all,  BWA does not guarantee to find all local hits as what
       BWT-SW is designed to do, but it is much faster  than  BWT-SW  on  both
       short and long query sequences.

       I  started to write the first piece of codes on 24 May 2008 and got the
       initial stable version on 02 June  2008.  During  this  period,  I  was
       acquainted  that  Professor  Tak-Wah  Lam,  the  first author of BWT-SW
       paper, was collaborating with Beijing Genomics Institute on SOAP2,  the
       successor  to  SOAP (Short Oligonucleotide Analysis Package). SOAP2 has
       come out in November 2008. According to the SourceForge download  page,
       the  third  BWT-based short read aligner, bowtie, was first released in
       August 2008. At the time of writing this manual, at  least  three  more
       BWT-based short-read aligners are being implemented.

       The  BWA-SW  algorithm  is  a new component of BWA. It was conceived in
       November 2008 and implemented ten months later.

       The BWA-MEM algorithm is based on an  algorithm  finding  super-maximal
       exact  matches (SMEMs), which was first published with the fermi assem-
       bler paper in 2012. I first implemented the basic SMEM algorithm in the
       fastmap command for an experiment and then extended the basic algorithm
       and added the extension part in Feburary 2013 to make BWA-MEM  a  fully
       featured mapper.

bwa-0.7.6                       31 January 2014                         bwa(1)