I would like to share a combination of techniques which produce an aligned+sorted+indexed bam in 34m for $1.89
BWA-MEM2meme
Here is the entire bwa-meme
command which daylily executes the following on a 128 vcpu instance:
JEMALLOC_CONF='oversize_threshold:1024,background_thread:true,huge:true,huge_page:1G,arena_max:4' \
LD_PRELOAD=resources/lib/libjemalloc.so.1 \
resources/numa/numactl \
--cpunodebind=0,1 \
--membind=0,1 \
bwa-meme mem \
-R '@RG\tID:RIH0A0\tSM:RIH0A022\tLB:NoAmpWGS\tPL:ILMN\tPU:CombinedLanes\tCN:CenterName\tPG:bwamem2meme' \
-Y \
-K 1204800000 \
-k 22 \
-t 128 \
-Z bwa-mem2b/human_1kg_v37.fasta.gz \
<( unpigz -c -q -- R1.fastq.gz ) \
<( unpigz -c -q -- fastq.gz ) \
| mbuffer -m 20G \
| samtools sort -l 0 -m 23G \
-@ 8 -T $tdir -O SAM - \
| samtools view \
-b \
-1 \
-@ 120 \
-O BAM \
--write-index \
-o sort.bam##idx##bam.bai - >> bwa2meme.sort.log;
Under ideal EC2 weather, a 30x genome can be aligned+sorted+indexed to
b37
in ~34m ( this time includes sorting and indexing as well!).
BWA-MEM2meme is a fork and improvement on BWA-MEM2, which has been around for > 3 years at this point. The code is available on github, and there is a companion publication.
mem2
offers:
- a ~2.7x speed improvement over
mem
.- has several public investigations. Heng Li even offered input to one such investigation that demonstrated its promise.
bwa-mem2
produces the same output asbwa
given the same data and arguments.
In the years since
mem2
has been available, there have been a surprising number of improvements on the originalmem2
, such as: [MEM2-ert](https://github.com/bwa-mem2/bwa-mem2/tree/ert, [MEM2-lisa](https://github.com/bwa-mem2/bwa-mem2/tree/bwa-mem2-lisa, MEM2-meme, and a handful of others.
Daylily adopted BWA-MEM2meme
As An Mapper
I chose meme
for common reasons one choose informatic tools:
- I was able to compile it, the tool was backed by a publication, and it performed as promised: which was a roughly 30% improvement on top of
mem2
.
Optimizing bwa mem2meme
– There Is A Lot Going On In This Command
I have used several approaches to squeeze more performance out of meme
:
- Run on high memory 128 vcpu instances, vetted to benefit from the following.
- Re-compiled from source, using the intel compiler, with compiler flags (notably the avx-512 instruction sets) tuned for the expected families of intel h/w I planned to run on.
- Here is where some might wrinkle their nose: “you should not need to be concerned about what hardware you’re running on, it makes the system fragile” is one critique. I argue that at some level all tools are compiled to run on limited hardware. And, trading some in the moment annoyance to gain substantial runtime gains seems a no brainer to me.
- Compiled an optimized
jemmaloc
which I override the system memory allocator usingLD_PRELOAD
. Further, I pass some additional arguments to jemalloc.oversize_threshold:1024,background_thread:true,huge:true,huge_page:1G,arena_max:4
numa
is also used to further streamline memory usage- Avoid at all costs hitting disk by piping between processes (downside, annoying to debug… see my comment on the topic of hardware specific binaries grin), and benefit from in-line sorting. Further, no sort io means no need to compress while sorting.
- While writing out the sorted
bam
, use the funky (and undocumented still?) feature of samtools to write thebam
andbai
in parallel. Piped sorting while bwa2 meme holds a large index in-mem means you often do need substantial memory.mbuffer
buffers the input frombwa mem2
tosamtools sort
, which reduces runtime by 4m.
- Use process substitution to wrap the
R1
andR2
fastq files with unpigs to feed bwa-mem2 meme just a little faster than it would on it’s own. - The
-Y
flag is important to set (it’s unset by default). This saves aligned reads, and does not throw out hard clipped segments of reads. If this flag is not set, it is impossible to reconstitute the original fastq files reads (read: you can toss the fq’s with this set and use thebam
as your read source if needed).
Runtime Profile
For kicks here are the usual bwa
stats.
When actively processing reads, meme
processes, at its fastest, 1,324,504 reads in 2.184s.
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 277 GiB total, buffer 0% full007-1[0000][ M::mem_process_seqs] Processed 1324504 reads in 256.213 CPU sec, 2.184 real sec
bwa meme
profile
Runtime profile:
Time taken for main_mem function: 1685.58 sec
IO times (sec) :
Reading IO time (reads) avg: 1589.47, (1589.47, 1589.47)
Writing IO time (SAM) avg: 246.40, (246.40, 246.40)
Reading IO time (Reference Genome) avg: 1.67, (1.67, 1.67)
Index read time avg: 0.29, (0.29, 0.29)
Overall time (sec) (Excluding Index reading time):
PROCESS() (Total compute time + (read + SAM) IO time) : 1683.61
MEM_PROCESS_SEQ() (Total compute time (Kernel + SAM)), avg: 1189.57, (1189.57, 1189.57)
SAM Processing time (sec):
--WORKER_SAM avg: 452.85, (452.85, 452.85)
Kernels' compute time (sec):
Total kernel (smem+sal+bsw) time avg: 692.08, (692.08, 692.08)
MEM_ALN_CHAIN_FLT avg: 0.00, (0.00, 0.00)
MEM_ALN_CHAIN_SEED avg: 0.00, (0.00, 0.00)
ERT_SEED_CHAIN avg: 0.00, (0.00, 0.00)
LEARNED_SEED_CHAIN avg: 214.09, (217.10, 196.36)
SMEM compute avg: 0.00, (0.00, 0.00)
SAL compute avg: 0.00, (0.00, 0.00)
MEM_SA avg: 0.00, (0.00, 0.00)
BSW time, avg: 442.25, (445.32, 426.84)
Important parameter settings:
BATCH_SIZE: 512
MAX_SEQ_LEN_REF: 256
MAX_SEQ_LEN_QER: 128
MAX_SEQ_LEN8: 128
SEEDS_PER_READ: 500
SIMD_WIDTH8 X: 64
SIMD_WIDTH16 X: 32
AVG_SEEDS_PER_READ: 64
EC2 Costs: $1.97/genome
I let parallel cluster manage sorting out which instance type in the 128 vcpu pool I defined is the cheapest or has the most stable capacity. Recently, I have been running on the i4i.metal
EC2 instance type. It’s on-demand price is ~$11, but spot price is $3.29 ! … which means for 37m, it is costing $1.89 :-P
I added each optimization over time, which represents a fair bit of time spent w/bwa mem2-meme
.
fin
I hope this was interesting &| helpful !
–John Major