API reference

This part of the documentation covers the interfaces of kPAL’s Python library.

k-mer profiles

class kpal.klib.Profile(counts, name=None)

A k-mer profile provides k-mer counts and operations on them.

Instead of using the Profile constructor directly, you should generally use one of the profile construction methods:

Parameters:
  • counts (numpy.ndarray) – Array of integers where each element is the count for a k-mer. Ordering is alphabetically by the k-mer.
  • name (str) – Profile name.
balance()

Add the counts of the reverse complement of a k-mer to the k-mer and vice versa.

binary_to_dna(number)

Convert an integer to a DNA string.

Parameters:number (int) – Binary representation of a DNA sequence.
Returns:DNA string corresponding to number.
Return type:str
copy()

Create a copy of the k-mer profile. This returns a deep copy, so modifying the copy’s k-mer counts will not affect the original and vice versa.

Returns:Deep copy of profile.
Return type:Profile
dna_to_binary(sequence)

Convert a string of DNA to an integer.

Parameters:sequence (str) – DNA sequence.
Returns:Binary representation of sequence.
Return type:int
classmethod from_fasta(handle, length, name=None)

Create a k-mer profile from a FASTA file by counting all k-mers in each line.

Parameters:
  • handle (file-like object) – Open readable FASTA file handle.
  • length (int) – Length of the k-mers.
  • name (str) – Profile name.
Returns:

A k-mer profile.

Return type:

Profile

classmethod from_fasta_by_record(handle, length, prefix=None)

Create k-mer profiles from a FASTA file by counting all k-mers per record. Profiles are named by the record names.

Parameters:
  • handle (file-like object) – Open readable FASTA file handle.
  • length (int) – Length of the k-mers.
  • prefix (str) – If provided, the names of the k-mer profiles are prefixed with this.
Returns:

A generator yielding the created k-mer profiles.

Return type:

iterator(Profile)

classmethod from_file(handle, name=None)

Load the k-mer profile from a file.

Parameters:
  • handle (h5py.File) – Open readable k-mer profile file handle.
  • name (str) – Profile name.
Returns:

A k-mer profile.

Return type:

Profile

classmethod from_file_old_format(handle, name=None)

Load the k-mer profile from a file in the old plaintext format.

Parameters:
  • handle (file-like object) – Open readable k-mer profile file handle (old format).
  • name (str) – Profile name.
Returns:

A k-mer profile.

Return type:

Profile

classmethod from_sequences(sequences, length, name=None)

Create a k-mer profile from sequences by counting all k-mers in each sequence.

Parameters:
  • sequences (iterator(str)) – An iterable of string sequences.
  • length (int) – Length of the k-mers.
  • name (str) – Profile name.
Returns:

A k-mer profile.

Return type:

Profile

mean

Mean of k-mer counts.

median

Median of k-mer counts.

merge(profile, merger=<function <lambda>>)

Merge two profiles.

Parameters:
  • profile (Profile) – Another k-mer profile.
  • merger (function) – A pairwise merge function.

Note that function must be vectorized, i.e., it is called directly on NumPy arrays, instead of on their pairwise elements. If your function only works on individual elements, convert it to a NumPy ufunc first. For example:

>>> f = np.vectorize(f, otypes=['int64'])
name

Profile name.

non_zero

Number k-mers with a non-zero count.

number

Number of possible k-mers with this length.

print_counts()

Print the k-mer counts.

reverse_complement(number)

Calculate the reverse complement of a DNA sequence in a binary representation.

Parameters:number (int) – Binary representation of a DNA sequence.
Returns:Binary representation of the reverse complement of the sequence corresponding to number.
Return type:int
save(handle, name=None)

Save the k-mer counts to a file.

Parameters:
  • handle (h5py.File) – Open writeable k-mer profile file handle.
  • name (str) – Profile name in the file. If not provided, the current profile name is used, or the first available number from 1 consecutively if the profile has no name.
Returns:

Profile name in the file.

Return type:

str

shrink(factor=1)

Shrink the profile, effectively reducing the value of k.

Note that this operation may give slightly different values than counting at a lower k directly.

Parameters:factor (int) – Shrinking factor.
shuffle()

Randomise the profile.

split()

Split the profile into two lists, every position in the first list has its reverse complement in the same position in the second list and vice versa. All counts are doubled, so we can equaly distribute palindrome counts over both lists.

Note that the returned counts are not k-mer profiles. They can be used to show the balance of the original profile by calculating the distance between them.

Returns:The doubled forward and reverse complement counts.
Return type:numpy.ndarray, numpy.ndarray
std

Standard deviation of k-mer counts.

total

Sum of k-mer counts.

k-mer profile distances

class kpal.kdistlib.ProfileDistance(do_balance=False, do_positive=False, do_smooth=False, summary=<Mock id='139953457026000'>, threshold=0, do_scale=False, down=False, distance_function=None, pairwise=<function <lambda>>)

Class of distance functions.

distance(left, right)

Calculate the distance between two k-mer profiles.

Parameters:left, right (kpal.klib.Profile) – Profiles to calculate distance between.
Returns:The distance between left and right.
Return type:float
dynamic_smooth(left, right)

Smooth two profiles by collapsing sub-profiles that do not meet the requirements governed by the selected summary function and the threshold.

Parameters:left, right (kpal.klib.Profile) – Profiles to smooth.
kpal.kdistlib.distance_matrix(profiles, output, precision, dist)

Make a distance matrix for any number of k-mer profiles.

Parameters:
  • profiles (list(Profile)) – List of profiles.
  • output (file-like object) – Open writable file handle.
  • precision (int) – Number of digits in the output.
  • dist (kpal.kdistlib.ProfileDistance) – A distance functions object.

Metrics

General library containing metrics and helper functions.

kpal.metrics.cosine_similarity(left, right)

Calculate the Cosine similarity between two vectors.

Parameters:left, right (array_like) – Vector.
Returns:The Cosine similarity between left and right.
Return type:float
kpal.metrics.distribution(vector)

Calculate the distribution of the values in a vector.

Parameters:vector (iterable(int)) – A vector.
Returns:A list of (value, count) pairs.
Return type:list(int, int)
kpal.metrics.euclidean(left, right)

Calculate the Euclidean distance between two vectors.

Parameters:left, right (array_like) – Vector.
Returns:The Euclidean distance between left and right.
Return type:float
kpal.metrics.get_scale(left, right)

Calculate scaling factors based upon total counts. One of the factors is always one (the other is either one or larger than one).

Parameters:left, right (array_like) – A vector.
Returns:A tuple of scaling factors.
Return type:float, float
kpal.metrics.mergers = {u'int': <function <lambda> at 0x7f49740ea500>, u'sum': <function <lambda> at 0x7f49740ea410>, u'xor': <function <lambda> at 0x7f49740ea488>, u'nint': <function <lambda> at 0x7f49740ea578>}

Merge functions. Arguments should be of type numpy.ndarray.

kpal.metrics.multiset(left, right, pairwise)

Calculate the multiset distance between two vectors.

Parameters:
  • left, right (array_like) – Vector.
  • pairwise (function) – A pairwise distance function.
Returns:

The multiset distance between left and right.

Return type:

float

Note that function must be vectorized, i.e., it is called directly on NumPy arrays, instead of on their pairwise elements. If your function only works on individual elements, convert it to a NumPy ufunc first. For example:

>>> f = np.vectorize(f, otypes=['float'])
kpal.metrics.pairwise = {u'sum': <function <lambda> at 0x7f49740ea398>, u'prod': <function <lambda> at 0x7f49740ea320>}

Pairwise distance functions. Arguments should be of type numpy.ndarray.

kpal.metrics.positive(vector, mask)

Set all zero positions in mask to zero in vector.

Parameters:vector, mask (array_like) – Vector.
Returns:vector with all zero positions in mask set to zero.
Return type:numpy.ndarray
kpal.metrics.scale_down(left, right)

Normalise scaling factor between 0 and 1.

Parameters:left, right (float) – Scaling factors.
Returns:Tuple of normalised scaling factors.
Return type:float, float
kpal.metrics.summary = {u'average': <Mock id='139953456429392'>, u'median': <Mock id='139953456429328'>, u'min': <Mock id='139953457026000'>}

Summary functions.

kpal.metrics.vector_distance = {u'default': None, u'euclidean': <function euclidean at 0x7f49740ea230>, u'cosine': <function cosine_similarity at 0x7f49740ea2a8>}

Vector distance functions.

kpal.metrics.vector_length(vector)

Calculate the Euclidean length of a vector.

Parameters:vector (array_like) – A vector.
Returns:The length of vector.
Return type:float