I would like to share a combination of techniques which produce an aligned+sorted+indexed bam in 34m for $1.89



BWA-MEM2meme

Here is the entire bwa-meme command which daylily executes the following on a 128 vcpu instance:

     JEMALLOC_CONF='oversize_threshold:1024,background_thread:true,huge:true,huge_page:1G,arena_max:4' \
     LD_PRELOAD=resources/lib/libjemalloc.so.1 \  
     resources/numa/numactl \
       --cpunodebind=0,1 \
       --membind=0,1 \
    bwa-meme  mem \
       -R '@RG\tID:RIH0A0\tSM:RIH0A022\tLB:NoAmpWGS\tPL:ILMN\tPU:CombinedLanes\tCN:CenterName\tPG:bwamem2meme' \
       -Y \
       -K 1204800000 \
       -k 22 \
       -t 128 \
       -Z bwa-mem2b/human_1kg_v37.fasta.gz \
       <( unpigz -c -q -- R1.fastq.gz ) \
       <( unpigz -c  -q -- fastq.gz ) \
   | mbuffer -m 20G \
   |  samtools sort -l 0  -m 23G \
       -@  8 -T $tdir -O SAM - \
   |  samtools view \
       -b \
       -1 \
       -@ 120 \
       -O BAM \
       --write-index \
       -o sort.bam##idx##bam.bai -  >> bwa2meme.sort.log;

Under ideal EC2 weather, a 30x genome can be aligned+sorted+indexed to b37 in ~34m ( this time includes sorting and indexing as well!).

BWA-MEM2meme is a fork and improvement on BWA-MEM2, which has been around for > 3 years at this point. The code is available on github, and there is a companion publication. mem2 offers:

  • a ~2.7x speed improvement over mem.
  • has several public investigations. Heng Li even offered input to one such investigation that demonstrated its promise.
  • bwa-mem2 produces the same output as bwa given the same data and arguments.

In the years since mem2 has been available, there have been a surprising number of improvements on the original mem2, such as: [MEM2-ert](https://github.com/bwa-mem2/bwa-mem2/tree/ert, [MEM2-lisa](https://github.com/bwa-mem2/bwa-mem2/tree/bwa-mem2-lisa, MEM2-meme, and a handful of others.

Daylily adopted BWA-MEM2meme As An Mapper

I chose meme for common reasons one choose informatic tools:

  • I was able to compile it, the tool was backed by a publication, and it performed as promised: which was a roughly 30% improvement on top of mem2.

Optimizing bwa mem2meme – There Is A Lot Going On In This Command

I have used several approaches to squeeze more performance out of meme:

  • Run on high memory 128 vcpu instances, vetted to benefit from the following.
  • Re-compiled from source, using the intel compiler, with compiler flags (notably the avx-512 instruction sets) tuned for the expected families of intel h/w I planned to run on.
    • Here is where some might wrinkle their nose: “you should not need to be concerned about what hardware you’re running on, it makes the system fragile” is one critique. I argue that at some level all tools are compiled to run on limited hardware. And, trading some in the moment annoyance to gain substantial runtime gains seems a no brainer to me.
  • Compiled an optimized jemmaloc which I override the system memory allocator using LD_PRELOAD. Further, I pass some additional arguments to jemalloc. oversize_threshold:1024,background_thread:true,huge:true,huge_page:1G,arena_max:4
  • numa is also used to further streamline memory usage
  • Avoid at all costs hitting disk by piping between processes (downside, annoying to debug… see my comment on the topic of hardware specific binaries grin), and benefit from in-line sorting. Further, no sort io means no need to compress while sorting.
  • While writing out the sorted bam, use the funky (and undocumented still?) feature of samtools to write the bam and bai in parallel. Piped sorting while bwa2 meme holds a large index in-mem means you often do need substantial memory.
    • mbuffer buffers the input from bwa mem2 to samtools sort, which reduces runtime by 4m.
  • Use process substitution to wrap the R1 and R2 fastq files with unpigs to feed bwa-mem2 meme just a little faster than it would on it’s own.
  • The -Y flag is important to set (it’s unset by default). This saves aligned reads, and does not throw out hard clipped segments of reads. If this flag is not set, it is impossible to reconstitute the original fastq files reads (read: you can toss the fq’s with this set and use the bam as your read source if needed).

Runtime Profile

For kicks here are the usual bwa stats.

When actively processing reads, meme processes, at its fastest, 1,324,504 reads in 2.184s. in @ 0.0 KiB/s, out @ 0.0 KiB/s, 277 GiB total, buffer 0% full007-1[0000][ M::mem_process_seqs] Processed 1324504 reads in 256.213 CPU sec, 2.184 real sec

bwa meme profile

Runtime profile:

        Time taken for main_mem function: 1685.58 sec

        IO times (sec) :
        Reading IO time (reads) avg: 1589.47, (1589.47, 1589.47)
        Writing IO time (SAM) avg: 246.40, (246.40, 246.40)
        Reading IO time (Reference Genome) avg: 1.67, (1.67, 1.67)
        Index read time avg: 0.29, (0.29, 0.29)

        Overall time (sec) (Excluding Index reading time):
        PROCESS() (Total compute time + (read + SAM) IO time) : 1683.61
        MEM_PROCESS_SEQ() (Total compute time (Kernel + SAM)), avg: 1189.57, (1189.57, 1189.57)

         SAM Processing time (sec):
        --WORKER_SAM avg: 452.85, (452.85, 452.85)

        Kernels' compute time (sec):
        Total kernel (smem+sal+bsw) time avg: 692.08, (692.08, 692.08)
                MEM_ALN_CHAIN_FLT avg: 0.00, (0.00, 0.00)
                MEM_ALN_CHAIN_SEED avg: 0.00, (0.00, 0.00)
                ERT_SEED_CHAIN avg: 0.00, (0.00, 0.00)
        LEARNED_SEED_CHAIN avg: 214.09, (217.10, 196.36)
                SMEM compute avg: 0.00, (0.00, 0.00)
                SAL compute avg: 0.00, (0.00, 0.00)
                                MEM_SA avg: 0.00, (0.00, 0.00)

                BSW time, avg: 442.25, (445.32, 426.84)

Important parameter settings: 
        BATCH_SIZE: 512
        MAX_SEQ_LEN_REF: 256
        MAX_SEQ_LEN_QER: 128
        MAX_SEQ_LEN8: 128
        SEEDS_PER_READ: 500
        SIMD_WIDTH8 X: 64
        SIMD_WIDTH16 X: 32
        AVG_SEEDS_PER_READ: 64


EC2 Costs: $1.97/genome

I let parallel cluster manage sorting out which instance type in the 128 vcpu pool I defined is the cheapest or has the most stable capacity. Recently, I have been running on the i4i.metal EC2 instance type. It’s on-demand price is ~$11, but spot price is $3.29 ! … which means for 37m, it is costing $1.89 :-P

ec2_i4i_price_history

I added each optimization over time, which represents a fair bit of time spent w/bwa mem2-meme.

fin

I hope this was interesting &| helpful !

–John Major