K-mer in biomcmc-lib (2019.11.29)

This notebook contains a few details of how kmers are implemented in the low-level phylogenetic library biomcmc-lib. This library is opaque to the user, and contains functions that are common to several projects. The functions discussed here relate to the creation of the k-mers (and their hash values) for a given DNA sequence.

Please have in mind that these functions are likely to change or at least change names soon — they are general enough to have allowed me to play with kmers, but are too complex for production-level software. Two missing functionalities are:

  1. inexact (masked) k-mers.
  2. rolling hashes
  3. using 32 bits per hash value (64 bits may be overkill)
  4. homopolymer compression
  5. fine-grained control over k-mer sizes

Some code shown here depends on local libraries, which must be linked at compilation time (//%cflags:-lm etc.)

1.1 Parameters of a kmer set

Since it may be onerous to scan through the DNA sequence several times, I decided to generate kmers for several choices of k in one pass. The library uses 128 bits to store each kmer, and immediately transforms it into its 64 bits hash value.

Furthermore there are 3 possible encodings for each base: 1, 2, and 4 bits per base. The standard used by most kmer algorithms is equivalent to 2 bits per base (A->00, C->01, G->10, T->11). The other two are a compact, GC-content-like one bit per base(A,T->0, and C,G->1), and one common in the phylogenetic likelihood structure, that can accommodate ambiguous characters. For instance, since M represents an amino group (C or A), then we can have A->0001, C->0010, M->0011, and N->1111.

The code below is implemented in biomcmc-lib (2019.11.29), and is responsible for selecting the sets of k-mer sizes depending on an integer mode given by the user. This code is subject to change.

static uint64_t _tbl_mask[] = {0xffffUL, 0xffffffUL, 0xffffffffUL, 0xffffffffffUL, 0xffffffffffffUL, 0xffffffffffffUL, 0xffffffffffffffffUL};
static uint8_t _tbl_shift[] = {      48,         40,           32,             24,               16,                8,                    0};
static uint8_t _tbl_nbyte[] = {       2,          3,            4,              5,                6,                7,                    8};
static uint32_t _tbl_seed[] = {0x9040a6, 0x10bea992,   0x50edd67d,     0xb05a4f09,       0xf07046c5,       0x9c9445ab,           0xb2500f29};

/* i1[] and i2[] (i.e. elements above to be used on first and second 64bits, respectively */
static uint8_t _idx_mode[][2][7] = { // contains list of elements from _tbl above to be used, from 1st and 2nd 64bit blocks
   { {2,6,0,0,0,0,0}, {0,0,0,0,0,0,0} },  
   { {2,6,0,0,0,0,0}, {2,6,0,0,0,0,0} },  
   { {0,2,4,6,0,0,0}, {2,6,0,0,0,0,0} }, 
   { {0,1,2,4,6,0,0}, {0,2,6,0,0,0,0} },
   { {0,1,2,3,4,5,6}, {0,0,0,0,0,0,0} },
   { {0,1,2,3,4,5,6}, {0,1,2,6,0,0,0} }
};   
static uint8_t _n_idx[][2] = { {2,0}, {2,2}, {4,2}, {5,3}, {7,0}, {7,4} }; // how many elems from _idx_mode[] are used


kmer_params
new_kmer_params (int mode)
{
  uint8_t  i, j, row, bases_per_byte, _ba_pe_by[] = {2,4,8}; // bases_per_byte is 4 if dense
  kmer_params p = (kmer_params) biomcmc_malloc (sizeof (struct kmer_params_struct));
  p->ref_counter = 1;
  p->hashfunction = &biomcmc_xxh64;
  if (dna_in_4_bits[0][0] == 0xff) initialize_dna_to_bit_tables (); // run once per program
  p->kmer_class_mode = mode;
  switch (mode) { // map each choice to a set of kmers and bitstring encoding (row relates to _idx_mode[] above)
    case 0:  row = 0; p->dense = 1; break;
    case 1:  row = 2; p->dense = 1; break;
    case 2:  row = 3; p->dense = 0; break;
    case 3:
    default: row = 4; p->dense = 1; break;
    case 4:  row = 5; p->dense = 0; break;
    case 5:  row = 1; p->dense = 2; break;
  };
  bases_per_byte = _ba_pe_by[p->dense];

  p->n1 = (uint8_t) _n_idx[row][0]; p->n2 = (uint8_t) _n_idx[row][1];
  for (j=0; j < p->n1; j++) {
    i = _idx_mode[row][0][j];
    p->mask1[j] = _tbl_mask[i];
    p->shift1[j] = _tbl_shift[i];
    p->seed[j] = _tbl_seed[i];
    p->nbytes[j] = _tbl_nbyte[i];
    p->size[j] = _tbl_nbyte[i] * bases_per_byte;
  }
  for (j=0; j < p->n2; j++) {
    i = _idx_mode[row][1][j];
    p->mask2[j] = _tbl_mask[i];
    p->shift2[j] = _tbl_shift[i];
    p->seed[j] = (_tbl_seed[i] >> 2) + 0x420314a1d; // very noise, much random
    p->nbytes[j+p->n1] = _tbl_nbyte[i] + 8;
    p->size[j+p->n1] = (_tbl_nbyte[i] + 8) * bases_per_byte;
  }
  return p;
}

(out of curiosity, Liquid cannot handle the consecutive curly braces (https://github.com/Shopify/liquid/issues/927) thus I need to separate as in { {)

The small program below prints the resulting set of parameters for each choice of mode. Notice that the code must be linked to the libbiomcmc.a local library (some programs, like guenomu still install a global version of this library, but newer programs should use it locally, to avoid version conflicts and since it's smaller than one MB...)

//%cflags:-lm
//%cflags: -I/usr/users/QIB_fr005/deolivl/Academic/Quadram/009.supersptree/biomcmc-lib/lib
//%cflags: -I/usr/users/QIB_fr005/deolivl/Academic/Quadram/009.supersptree/build/biomcmc-lib/lib
//%cflags: /usr/users/QIB_fr005/deolivl/Academic/Quadram/009.supersptree/build/biomcmc-lib/lib/.libs/libbiomcmc.a
#include <biomcmc.h>

int main (){
    uint8_t i, j;
    for (j = 0; j < 6; j++) {
        kmer_params kp = new_kmer_params (j);
        printf ("<<%d>>\n", j);
        for (i=0;i<kp->n1;i++) printf("\t %3d %3d < 1 >\n", kp->nbytes[i], kp->size[i]);
        for (i=0;i<kp->n2;i++) printf("\t %3d %3d < 2 >\n", kp->nbytes[i + kp->n1], kp->size[i + kp->n1]);
        del_kmer_params (kp);
    }
}
<<0>>
	   4  16 < 1 >
	   8  32 < 1 >
<<1>>
	   2   8 < 1 >
	   4  16 < 1 >
	   6  24 < 1 >
	   8  32 < 1 >
	  12  48 < 2 >
	  16  64 < 2 >
<<2>>
	   2   4 < 1 >
	   3   6 < 1 >
	   4   8 < 1 >
	   6  12 < 1 >
	   8  16 < 1 >
	  10  20 < 2 >
	  12  24 < 2 >
	  16  32 < 2 >
<<3>>
	   2   8 < 1 >
	   3  12 < 1 >
	   4  16 < 1 >
	   5  20 < 1 >
	   6  24 < 1 >
	   7  28 < 1 >
	   8  32 < 1 >
<<4>>
	   2   4 < 1 >
	   3   6 < 1 >
	   4   8 < 1 >
	   5  10 < 1 >
	   6  12 < 1 >
	   7  14 < 1 >
	   8  16 < 1 >
	  10  20 < 2 >
	  11  22 < 2 >
	  12  24 < 2 >
	  16  32 < 2 >
<<5>>
	   4  32 < 1 >
	   8  64 < 1 >
	  12  96 < 2 >
	  16 128 < 2 >

biomcmc-lib commit: 5975331