Getting Started

The short tutorial below explains how to run kallisto using a small example distributed with the program. This tutorial covers how to use kallisto for processing bulk RNA sequencing data. If you want to process single cell RNA sequencing data, see the kallisto bus workflow

Download and installation

Begin by downloading and installing the program by following instructions on the download page. The files needed to confirm that kallisto is working are included with the binaries downloadable from the download page.

After downloading and installing kallisto you should be able to type kallisto and see:

kallisto 0.44.0

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    pseudo        Runs the pseudoalignment step 
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>

Building an index

kallisto quantifies read files directly without the need for read alignment, but it does perform a procedure called pseudoalignment. Pseudoalignment requires processing a transcriptome file to create a “transcriptome index”. To begin, first change directories to where the test files distributed with the kallisto executable are located:

cd kallisto/tests

Next, build an index type:

kallisto index -i transcripts.idx transcripts.fasta.gz

Single-cell RNA-Seq quantification

The analysis of single-cell RNA-Seq data involves a series of steps that include: (1) pre-processing of reads to associate them with their cells of origin, (2) possible collapsing of reads according to unique molecular identifiers (UMIs), (3) generation of feature counts from the reads to generate a feature-cell matrix and (4) analysis of the matrix to compare and contrast cells.

Some of these challenges are procedurally straightforward but computationally demanding. Others are are statistical in nature and require technology specific models. We have recently introduced a format for single-cell RNA-seq data called the BUS (Barcode, UMI, Set) format that facilitates the development of modular workflows to address the complexities of these challenges. It is described in P. Melsted, V. Ntranos and L. Pachter, “The Barcode, UMI, Set format and BUStools”, bioRxiv 2018.

BUS files can be generated from single-cell RNA-seq data produced with any technology and can, in principle, be produced by any pseudoalignment software. We have implemented a command in kallisto version 0.45.0 called “bus” that allows for the efficient generation of BUS format from any single-cell RNA-seq technology. Tools for manipulating BUS files are provided as part of the bustools package. Finally, R and python notebooks for processing and analyzing BUS files simplify and facilitate the process of developing and optimizing analysis workflows.

For detailed tutorials, see the

kallisto bus workflow website

Bulk RNA-seq quantification bulk

Now you can quantify abundances of the transcripts using the two read files reads_1.fastq.gz and reads_2.fastq.gz (the .gz suffix means the read files have been gzipped; kallisto can read in either plain-text or gzipped read files). To quantify abundances type:

kallisto quant -i transcripts.idx -o output -b 100 reads_1.fastq.gz reads_2.fastq.gz

You can also call kallisto with

kallisto quant -i transcripts.idx -o output -b 100 <(gzcat reads_1.fastq.gz) <(gzcat reads_2.fastq.gz)

or with linux, you replace gzcat with zcat or any other program that writes the FASTQ to stdout. This utilizes an additional core to uncompress the FASTQ files, and speeds up the program by 10–15%.

Single end reads (bulk)

If your reads are single end only you can run kallisto by specifying the --single flag,

kallisto quant -i transcripts.idx -o output -b 100 --single -l 180 -s 20 reads_1.fastq.gz

however you must supply the length and standard deviation of the fragment length (not the read length).

Results (bulk)

The results of a kallisto run are placed in the specified output directory (the -o option), and therefore the test results should be located in the subdirectory “output”. The contents of the directory should look like this:

total 568
-rw-r--r--  1 username  staff  282480 May  3 10:10 abundance.h5
-rw-r--r--  1 username  staff     589 May  3 10:10 abundance.tsv
-rw-r--r--  1 username  staff     227 May  3 10:10 run_info.json

The results of the main quantification, i.e. the abundance estimate using kallisto on the data is in the abundance.tsv file. Abundances are reported in “estimated counts” (est_counts) and in Transcripts Per Million (TPM). The abundance.tsv file you get should look like this:

target_id	length	eff_length	est_counts	tpm
ENST00000513300.5	1924	1746.98	102.328	11129.2
ENST00000282507.7	2355	2177.98	1592.02	138884
ENST00000504685.5	1476	1298.98	68.6528	10041.8
ENST00000243108.4	1733	1555.98	343.499	41944.9
ENST00000303450.4	1516	1338.98	664	94221.8
ENST00000243082.4	2039	1861.98	55	5612.36
ENST00000303406.4	1524	1346.98	304.189	42908.2
ENST00000303460.4	1936	1758.98	47	5076.85
ENST00000243056.4	2423	2245.98	42	3553.05
ENST00000312492.2	1805	1627.98	228	26609.9
ENST00000040584.5	1889	1711.98	4295	476675
ENST00000430889.2	1666	1488.98	623.628	79578.2
ENST00000394331.3	2943	2765.98	85.6842	5885.85
ENST00000243103.3	3335	3157.98	962	57879.3

The file is tab delimited so that it can easily parsed. The output can also be analyzed with the sleuth tool.

The run_info.json file contains a summary of the run, including data on the number targets used for quantification, the number of bootstraps performed, the version of the program used and how it was called. You should see this:

{
	"n_targets": 14,
	"n_bootstraps": 30,
	"n_processed": 10000,
	"n_pseudoaligned": 9413,
	"n_unique": 7174,
	"p_pseudoaligned": 94.1,
	"p_unique": 71.7,
	"kallisto_version": "0.44.0",
	"index_version": 10,
	"start_time": "Tue Jan 30 09:34:31 2018",
	"call": "kallisto quant -i transcripts.kidx -b 30 -o kallisto_out reads_1.fastq.gz reads_2.fastq.gz"
}

The h5 file contains the main quantification together with the boostraps in HDF5 format. The reason for this binary format is to compress the large output of runs with many bootstraps. The h5dump command in kallisto can be used to convert the file to plain-text.

To visualize the pseudoalignments we need to run kallisto with the --genomebam option. To do this we need two additional files, a GTF file, which describes where the transcripts lie in the genome, and a text file containing the length of each chromosome. These files are part of the test directory. To run kallisto we type

kallisto quant -i transcripts.kidx -b 30 -o kallisto_out --genomebam --gtf transcripts.gtf.gz --chromosomes chrom.txt reads_1.fastq.gz reads_2.fastq.gz

this is the same run as above, but now we supply --gtf transcripts.gtf.gz for the GTF file and the chromoeme file --chromosomes chrom.txt. For a larger transcriptome we recommend downloading the GTF file from the same release and data source as the FASTA file used to construct the index. The output now contains two additional files pseudoalignments.bam and pseudoalignments.bam.bai. The files can be viewed and processed using Samtools or a genome browser such as IGV. There is no need to sort or index the BAM file since kallisto does that directly. For windows users we recommend using the IGV browser, since there are no native Samtools releases (except using Linux Subsystem on Windows 10).

That’s it.

You can now run kallisto on your dataset of choice. For convenience, we have placed some transcriptome fasta files for human and model organisms here. Publicly available RNA-Seq data can be found on the short read archive (a convenient mirror and interface to the SRA is available here). While kallisto cannot process .sra files, such files can be converted to FASTQ with the fastq-dump tool which is part of the SRA Toolkit.