K-mer in biomcmc-lib
This notebook contains a few details of how kmers are implemented in the low-level phylogenetic library biomcmc-lib.
K-mer in biomcmc-lib (2019.11.29)
This notebook contains a few details of how kmers are implemented in the low-level phylogenetic library biomcmc-lib. This library is opaque to the user, and contains functions that are common to several projects. The functions discussed here relate to the creation of the k-mers (and their hash values) for a given DNA sequence.
Please have in mind that these functions are likely to change or at least change names soon — they are general enough to have allowed me to play with kmers, but are too complex for production-level software. Two missing functionalities are:
- inexact (masked) k-mers.
- rolling hashes
- using 32 bits per hash value (64 bits may be overkill)
- homopolymer compression
- fine-grained control over k-mer sizes
Some code shown here depends on local libraries, which must be linked at compilation time (//%cflags:-lm
etc.)
1.1 Parameters of a kmer set
Since it may be onerous to scan through the DNA sequence several times, I decided to generate kmers for several choices of k
in one pass.
The library uses 128 bits to store each kmer, and immediately transforms it into its 64 bits hash value.
Furthermore there are 3 possible encodings for each base: 1, 2, and 4 bits per base.
The standard used by most kmer algorithms is equivalent to 2 bits per base (A->00
, C->01
, G->10
, T->11
).
The other two are a compact, GC-content-like one bit per base(A,T->0
, and C,G->1
), and one common in the phylogenetic likelihood structure, that can accommodate ambiguous characters.
For instance, since M
represents an amino group (C
or A
), then we can have A->0001
, C->0010
, M->0011
, and N->1111
.
The code below is implemented in biomcmc-lib (2019.11.29), and is responsible for selecting the sets of k-mer sizes depending on an integer mode
given by the user.
This code is subject to change.
static uint64_t _tbl_mask[] = {0xffffUL, 0xffffffUL, 0xffffffffUL, 0xffffffffffUL, 0xffffffffffffUL, 0xffffffffffffUL, 0xffffffffffffffffUL};
static uint8_t _tbl_shift[] = { 48, 40, 32, 24, 16, 8, 0};
static uint8_t _tbl_nbyte[] = { 2, 3, 4, 5, 6, 7, 8};
static uint32_t _tbl_seed[] = {0x9040a6, 0x10bea992, 0x50edd67d, 0xb05a4f09, 0xf07046c5, 0x9c9445ab, 0xb2500f29};
/* i1[] and i2[] (i.e. elements above to be used on first and second 64bits, respectively */
static uint8_t _idx_mode[][2][7] = { // contains list of elements from _tbl above to be used, from 1st and 2nd 64bit blocks
{ {2,6,0,0,0,0,0}, {0,0,0,0,0,0,0} },
{ {2,6,0,0,0,0,0}, {2,6,0,0,0,0,0} },
{ {0,2,4,6,0,0,0}, {2,6,0,0,0,0,0} },
{ {0,1,2,4,6,0,0}, {0,2,6,0,0,0,0} },
{ {0,1,2,3,4,5,6}, {0,0,0,0,0,0,0} },
{ {0,1,2,3,4,5,6}, {0,1,2,6,0,0,0} }
};
static uint8_t _n_idx[][2] = { {2,0}, {2,2}, {4,2}, {5,3}, {7,0}, {7,4} }; // how many elems from _idx_mode[] are used
kmer_params
new_kmer_params (int mode)
{
uint8_t i, j, row, bases_per_byte, _ba_pe_by[] = {2,4,8}; // bases_per_byte is 4 if dense
kmer_params p = (kmer_params) biomcmc_malloc (sizeof (struct kmer_params_struct));
p->ref_counter = 1;
p->hashfunction = &biomcmc_xxh64;
if (dna_in_4_bits[0][0] == 0xff) initialize_dna_to_bit_tables (); // run once per program
p->kmer_class_mode = mode;
switch (mode) { // map each choice to a set of kmers and bitstring encoding (row relates to _idx_mode[] above)
case 0: row = 0; p->dense = 1; break;
case 1: row = 2; p->dense = 1; break;
case 2: row = 3; p->dense = 0; break;
case 3:
default: row = 4; p->dense = 1; break;
case 4: row = 5; p->dense = 0; break;
case 5: row = 1; p->dense = 2; break;
};
bases_per_byte = _ba_pe_by[p->dense];
p->n1 = (uint8_t) _n_idx[row][0]; p->n2 = (uint8_t) _n_idx[row][1];
for (j=0; j < p->n1; j++) {
i = _idx_mode[row][0][j];
p->mask1[j] = _tbl_mask[i];
p->shift1[j] = _tbl_shift[i];
p->seed[j] = _tbl_seed[i];
p->nbytes[j] = _tbl_nbyte[i];
p->size[j] = _tbl_nbyte[i] * bases_per_byte;
}
for (j=0; j < p->n2; j++) {
i = _idx_mode[row][1][j];
p->mask2[j] = _tbl_mask[i];
p->shift2[j] = _tbl_shift[i];
p->seed[j] = (_tbl_seed[i] >> 2) + 0x420314a1d; // very noise, much random
p->nbytes[j+p->n1] = _tbl_nbyte[i] + 8;
p->size[j+p->n1] = (_tbl_nbyte[i] + 8) * bases_per_byte;
}
return p;
}
(out of curiosity, Liquid cannot handle the consecutive curly braces (https://github.com/Shopify/liquid/issues/927) thus I need to separate as in { {
)
The small program below prints the resulting set of parameters for each choice of mode
.
Notice that the code must be linked to the libbiomcmc.a
local library (some programs, like guenomu still install a global version of this library, but newer programs should use it locally, to avoid version conflicts and since it's smaller than one MB...)
//%cflags:-lm
//%cflags: -I/usr/users/QIB_fr005/deolivl/Academic/Quadram/009.supersptree/biomcmc-lib/lib
//%cflags: -I/usr/users/QIB_fr005/deolivl/Academic/Quadram/009.supersptree/build/biomcmc-lib/lib
//%cflags: /usr/users/QIB_fr005/deolivl/Academic/Quadram/009.supersptree/build/biomcmc-lib/lib/.libs/libbiomcmc.a
#include <biomcmc.h>
int main (){
uint8_t i, j;
for (j = 0; j < 6; j++) {
kmer_params kp = new_kmer_params (j);
printf ("<<%d>>\n", j);
for (i=0;i<kp->n1;i++) printf("\t %3d %3d < 1 >\n", kp->nbytes[i], kp->size[i]);
for (i=0;i<kp->n2;i++) printf("\t %3d %3d < 2 >\n", kp->nbytes[i + kp->n1], kp->size[i + kp->n1]);
del_kmer_params (kp);
}
}
biomcmc-lib commit: 5975331