Tutorial¶
Before following this tutorial, make sure kPAL is installed properly:
$ kpal -h
This should print a help message. If it does not, follow Installation.
We work with an artificial dataset consisting of 200 read pairs from four different samples. They are randomly generated so have no biological relevance.
Note
Download the data: tutorial.zip
Now unzip the file and go to the resulting directory:
$ unzip -q tutorial.zip
$ cd tutorial
$ ls
a_1.fa a_2.fa b_1.fa b_2.fa c_1.fa c_2.fa d_1.fa d_2.fa
We’ll create k-mer profiles for these samples and try to compare them.
k-mer counting¶
kPAL can count k-mers in any number of fasta files and store the results in one k-mer profile file. By default, the profiles in the file are named according to the original fasta filenames.
Let’s count 8-mers in the first read for all samples and write the profiles to
reads_1.k8
:
$ kpal count -k 8 *_1.fa reads_1.k8
Using the info command, we can get an overview of our profiles:
$ kpal info reads_1.k8
File format version: 1.0.0
Produced by: kPAL 2.0.0
Profile: a_1
- k-mer length: 8 (65536 k-mers)
- Zero counts: 49395
- Non-zero counts: 16141
- Sum of counts: 18600
- Mean of counts: 0.284
- Median of counts: 0.000
- Standard deviation of counts: 0.535
Profile: b_1
- k-mer length: 8 (65536 k-mers)
- Zero counts: 49348
- Non-zero counts: 16188
- Sum of counts: 18600
- Mean of counts: 0.284
- Median of counts: 0.000
- Standard deviation of counts: 0.533
Profile: c_1
- k-mer length: 8 (65536 k-mers)
- Zero counts: 49388
- Non-zero counts: 16148
- Sum of counts: 18600
- Mean of counts: 0.284
- Median of counts: 0.000
- Standard deviation of counts: 0.534
Profile: d_1
- k-mer length: 8 (65536 k-mers)
- Zero counts: 49345
- Non-zero counts: 16191
- Sum of counts: 18600
- Mean of counts: 0.284
- Median of counts: 0.000
- Standard deviation of counts: 0.533
Merging profiles¶
For completeness, we also want to include k-mer counts for the second read in our analysis. We can do so using the merge command:
$ kpal count -k 8 *_2.fa reads_2.k8
$ kpal merge reads_1.k8 reads_2.k8 merged.k8
Note
Merging two k-mer profiles this way is equivalent to first concatenating both fasta files and counting in the result.
By default, profiles from both files are merged pairwise in alphabetical order. If you need another pairing, you can provide profile names to use for both files. For example, the following is a more explicit version of the previous command:
$ kpal merge reads_1.k8 reads_2.k8 merged.k8 -l a_1 b_1 c_1 d_1 -r a_2 b_2 c_2 d_2
We can check that, indeed, the total k-mer count has doubled compared to our previous numbers:
$ kpal info merged.k8 -p c_1_c_2
File format version: 1.0.0
Produced by: kPAL 2.0.0
Profile: c_1_c_2
- k-mer length: 8 (65536 k-mers)
- Zero counts: 37138
- Non-zero counts: 28398
- Sum of counts: 37200
- Mean of counts: 0.568
- Median of counts: 0.000
- Standard deviation of counts: 0.753
Distance between profiles¶
We can compare two profiles by using a distance function. By default, distance uses the multiset distance parameterised by the prod pairwise distance function (\(f_2\) in Distance metrics):
$ kpal distance reads_1.k8 reads_2.k8 -l c_1 -r c_2
c_1 c_2 0.456
All profiles in a file can be compared pairwise to produce a distance matrix
with the matrix command. It first writes the number of profiles compared
followed by their names, and then the distance matrix itself. Here we ask it
to print the result to standard output (using -
for the output filename):
$ kpal matrix merged.k8 -
4
a_1_a_2
b_1_b_2
c_1_c_2
d_1_d_2
0.415
0.416 0.416
0.414 0.413 0.414
Enforcing strand balance¶
Todo.
Custom merge functions¶
Todo.