k-mer

For a broader coverage related to this topic, see n-gram.

The term k-mer typically refers to all the possible substrings of length k that are contained in a string. In computational genomics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA Sequencing. The amount of k-mers possible given a string of length, L, is whilst the number of possible k-mers given n possibilities (4 in the case of DNA e.g. ACTG) is . K-mers are typically used during sequence assembly,[1] but can also be used in sequence alignment. In the context of the human genome, k-mers of various lengths have been used to explain variability in mutation rates.[2][3]

Sequence Assembly

Overview

In sequence assembly, k-mers are typically used during the construction of De Bruijn graphs. In order to create a De Bruijn Graph, the strings stored in each edge with length, , must overlap another string in another edge by in order to create a vertex. Reads generated from next-generation sequencing will typically have different read lengths being generated. For example, reads by Illumina’s sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is though, that these small fractions of the possible k-mers violates the key assumption of de Bruijn graphs such that all the k-mer reads must overlap its adjoining k-mer in the genome by (which can’t occur when all the possible k-mers aren’t present). The solution to this problem is to break these k-mer sized reads into smaller k-mers, such that the resulting smaller k-mers will represent all the possible k-mers of that smaller size that are present in the genome.[1] Furthermore, splitting the k-mers into smaller sizes also helps alleviate the problem of different initial read lengths. An example of the solution of splitting the reads into smaller k-mers is shown in figure 1. In this example the 5 reads do not account for all the possible 7-mers of the genome, and as such, a de Bruijn graph cannot be created. But when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a de Bruijn graph.

blah blah blah.
This figure shows the process of splitting reads into smaller k-mers (4-mers in this case) in order to be able to be used in a de Bruijn graph. (A) Shows the initial segment of DNA being sequenced. (B) Shows the reads that were outputted from sequencing and also shows how they align. The problem with this alignment though is that they overlap by k-2 not k-1 (which is needed in de Bruijn graphs). (C) Shows the reads being split into smaller 4-mers. (D) Discards the repeated 4-mers and then shows the alignment of them. Note that these k-mers overlap by k-1 and can then be used in a de Bruijn graph.

Choice of k-mer

The choice of the k-mer size has many different effects on the sequence assembly. These effects vary greatly between lower sized and larger sized k-mers. Therefore, an understanding of the different k-mer sizes must be known in order to choose a suitable size that balances the effects. The effects of the sizes are outlined below.

Lower k-mer sizes

Higher k-mer sizes

Relevance to the human genome

By taking into account k-mers of sizes three, five and seven, it has been possible to better predict the probability of polymorphism mutations within human populations. The sequence context around a mutation, as represented by these k-mers, explains known greater rates (for example, at CpG sites) and predicts new motifs more or less likely to be mutated. [REFERENCE PENDING]

Pseudocode

Determining the possible k-mers of a read can be done by simply cycling over the string length by one and taking out each substring of length, k. The pseudocode to achieve this is as follows:

procedure k-mer(String, k : length of each k-mer)

     n = length(String)
     
     /* cycle over the length of String till k-mers of length, k, can still be made */
     for i = 1 to  n-k+1 inclusive do
          /* output each k-mer of length k, from i to i+k in String*/
          output String[i:i+k]
     end for

end procedure

Implementations

A number of implementations in different languages to find all the k-mers in a string are listed below.

Python

def find_kmers(string, k):
    
      kmers = []
      n = len(string)

      for i in range(0, n-k+1):
           kmers.append(string[i:i+k])

      return kmers

R

find_kmers <- function(string, k){
  n <- nchar(string) - k + 1
  kmers <- substring(string, 1:n, 1:n + k - 1)
  return(kmers)
}

Ruby

class String
  # Iterate over each k-mer of this string
  def each_kmer k
    return enum_for(:each_kmer, k) unless block_given?
    (0 .. length - k).each { |i|
      yield self[i, k]
    }
  end
end

Java

private void getKmers(String seq)
{
	int seqLength = seq.length();
	if(seqLength > length)
	{
		for(int i = 0; i < seqLength - length + 1; i++)
		{
			System.out.println(seq.substring(i,length + i));
		}
	} else 
	{
		System.out.println(seq);
	}
}

Perl 6

my $seq = 'AGCTTTTCATTCTGACTGCAACGGG';
my $n   = $seq.chars - $k + 1;
my @kmers = gather for 0..^$n -> $i {
    take $seq.substr($i, $k);
}

Or:

my $k = 10;
dd 'AGCTTTTCATTCTGACTGCAACGGG'.comb.rotor($k => -1 * ($k - 1)).map(*.join);

C#

public static List<string> find_kmers(string Text, int k)
{
    var kmers = new List<string>();
    int n = Text.Length;
   
    for (int i = 0; i < n-k+1; i++)
    {                
        kmers.Add(Text.Substring(i, k));               
    }

    return kmers;
 }

Examples

Here are some examples showing the possible k-mers (given a specified k value) from DNA sequences:

Read:     AGATCGAGTG
3-mers: AGA GAT ATC TCG CGA GAG AGT GTG
Read:     GTAGAGCTGT
5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT

References

  1. 1 2 Compeau, P., Pevzner, P. & Teslar, G. How to apply de Bruijn graphs to genome assembly”. Nature Biotechnology, 2011. doi:10.1038/nbt.2023.
  2. Samocha, Kaitlin E; Robinson, Elise B; Sanders, Stephan J; Stevens, Christine; Sabo, Aniko; McGrath, Lauren M; Kosmicki, Jack A; Rehnström, Karola; Mallick, Swapan; Kirby, Andrew; Wall, Dennis P; MacArthur, Daniel G; Gabriel, Stacey B; DePristo, Mark; Purcell, Shaun M; Palotie, Aarno; Boerwinkle, Eric; Buxbaum, Joseph D; Cook, Edwin H; Gibbs, Richard A; Schellenberg, Gerard D; Sutcliffe, James S; Devlin, Bernie; Roeder, Kathryn; Neale, Benjamin M; Daly, Mark J (2014). "A framework for the interpretation of de novo mutation in human disease". Nature Genetics. 46 (9): 944–950. doi:10.1038/ng.3050. ISSN 1061-4036.
  3. Aggarwala, Varun; Voight, Benjamin F (2016). "An expanded sequence context model broadly explains variability in polymorphism levels across the human genome". Nature Genetics. 48 (4): 349–355. doi:10.1038/ng.3511. ISSN 1061-4036.
  4. 1 2 Zerbino, Daniel R.; Birney, Ewan (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. doi:10.1101/gr.074492.107. PMC 2336801Freely accessible. PMID 18349386.
This article is issued from Wikipedia - version of the 11/29/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.