KNNImpute

  • Download for Windows
  • Download for Linux
  • KNNImputer source code is available as part of the Sleipnir library for computational functional genomics.

    Troyanskaya, OG, Cantor, M, Sherlock, G, Brown, P, Hastie, T, Tibshirani, R, Botstein, D, Altman, RB. Missing value estimation methods for DNA microarrays. Bioinformatics, 17:520-5, 2001.

Usage

  • KNNImputer -i input.pcl -o output.pcl

where input.pcl is a tab-delimited PCL file containing unique gene IDs in the first column, zero or more columns of header information (default 2), and one or more columns of data. output.pcl will contain the imputed version of the data. By default, genes with more than 30% missing data are removed rather than imputed. Other command line parameters of interest are:

  • -h provides help on the command line parameters.
  • -i (default standard input) is the file containing input data to be imputed.
  • -k (default 10) indicates the number of nearest neighbors to use during imputation.
  • -o (default standard output) is the file to contain the imputed version of the data.
  • -s (default 2) indicates the number of non-data columns to skip after the initial gene IDs (e.g. NAME, GWEIGHT, and so forth).
  • -m (default 0.7) is the fraction of conditions which must be present for a gene to be imputed; genes with less than the indicated amount of data are removed rather than imputed.
  • -d (default euclidean) describes what distance measure to use when computing gene pair distances. Available options include euclidean, pearson, spearman, kendalls, and kolm-smir.
  • -l (default none) indicates a limit on the number of genes to be cached in memory during imputation. If you have problems with KNNImputer consuming too much memory, try setting a small value such as -l 1000 or -l 5000.

Examples

Given the input file:

GID GWEIGHT Condition 1 Condition 2 Condition 3
Gene 1 1 0.3   0.6
Gene 2 1 -0.5 -0.8 -0.6
Gene 3 1 1.3 1.5  
...

KNNImputer might be run as:

  • KNNImputer -i input.pcl -o output.pcl -s 1

To inpute all genes (regardless of missing data amounts), to use more than the default number of nearest neighbors, and to compute distances using Pearson correlation:

  • KNNImputer -i input.pcl -o output.pcl -s 1 -k 12 -d pearson -m 0

Implementation

Implemented using the Sleipnir library for computational functional genomics, Huttenhower et al 2008.