Getting Started Explained

0. Download and install the software

Open up your terminal and run the following commands. These are the expected outputs:

$ kallisto
kallisto 0.45.1

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index
    quant         Runs the quantification algorithm
    bus           Generate BUS files for single-cell data
    pseudo        Runs the pseudoalignment step
    merge         Merges several batch runs
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>

$ bustools
Usage: bustools <CMD> [arguments] ..

Where <CMD> can be one of:

sort            Sort bus file by barcodes and UMI
text            Output as tab separated text file
merge           Merge bus files from same experiment
correct         Error correct bus files
count           Generate count matrices from bus file
capture         Capture reads mapping to a transcript capture list

Running bustools <CMD> without arguments prints usage information for <CMD>

If you don’t see this then you have either the programs are not installed the programs correctly, or you have not added the location of the binaries to your $PATH variable in your ~/.bashrc file in order to execute from any location. See here if you need to make a local build without root.

1. Download a reference, whitelist, gene map utility, and dataset

1a. Reference

Transcriptome FASTA

The kallisto | bustools workflow uses a standard ensembl transcriptome fasta file reference to build an index. This index makes it easy (and fast!) to pseudoalign RNA sequencing reads.

Navigate to the ensembl website http://uswest.ensembl.org/ and select your species of interest. We will be using the Mouse (Mus Musculus) reference.

Once on http://uswest.ensembl.org/Mus_musculus/Info/Index select Download Fasta under the Gene annotation section. This will take you to ftp://ftp.ensembl.org/pub/release-96/fasta/mus_musculus/. Select cdna. Right-click on Mus_musculus.GRCm38.cdna.all.fa.gz and select Copy Link Address.

On your terminal make a folder where you want to download your index and data.

$ mkdir kallisto_bustools_getting_started
$ cd kallisto_bustools_getting_started

Then download the fasta reference using the link is one that you copied.

$ wget ftp://ftp.ensembl.org/pub/release-96/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

Genome annotation GTF

Next download the GTF file. Do the exact same as above but instead of clicking Download Fasta click Download GTF under the Gene annotation section. Right-click on Mus_musculus.GRCm38.96.gtf select Copy Link Address and download this file on your terminal.

$ wget ftp://ftp.ensembl.org/pub/release-96/gtf/mus_musculus/Mus_musculus.GRCm38.96.gtf.gz

1b. Barcode whitelist

Navigate to https://github.com/pachterlab/kallistobuspaper_2019/releases/tag/getting_started right-click on 10xv2_whitelist.txt selectCopy Link Address and download this file on your terminal.

$ wget https://github.com/pachterlab/kallistobuspaper_2019/releases/download/getting_started/10xv2_whitelist.txt

1c. Gene map utility

Navigate to https://github.com/pachterlab/kallistobuspaper_2019/releases/tag/getting_started right-click on t2g.py selectCopy Link Address and download this file on your terminal.

$ wget https://github.com/pachterlab/kallistobuspaper_2019/releases/download/getting_started/t2g.py

and make the script executable

$ chmod +x t2g.py

1d. Dataset

Navigate to https://github.com/pachterlab/kallistobuspaper_2019/releases/tag/getting_started, right-click on SRR8599150_S1_L001_R1_001.fastq.gz selectCopy Link Address and download this file on your terminal, and do the same for SRR8599150_S1_L001_R2_001.fastq.gz.

This is a single cell experiment on mouse retinal cells. You can find the original data (GSM3612831) deposited here: https://www.ncbi.nlm.nih.gov/sra/?term=GSM3612831

$ wget https://github.com/pachterlab/kallistobuspaper_2019/releases/download/getting_started/SRR8599150_S1_L001_R1_001.fastq.gz
$ wget https://github.com/pachterlab/kallistobuspaper_2019/releases/download/getting_started/SRR8599150_S1_L001_R2_001.fastq.gz

1e. Results

After downloading the transcriptome reference and GTF file from ensembl, the barcode whitelist and the data you should have the following file structure:

kallisto_bustools_getting_started/
├── Mus_musculus.GRCm38.96.gtf.gz
├── Mus_musculus.GRCm38.cdna.all.fa.gz
├── SRR8599150_S1_L001_I1_001.fastq.gz
├── SRR8599150_S1_L001_R1_001.fastq.gz
├── SRR8599150_S1_L001_R2_001.fastq.gz
├── t2g.py
└── 10xv2_whitelist.txt

0 directories, 6 files

2. Build the index and gene map

2a. Index

The index only needs to be built once for each species transcriptome for a given k-mer size (the default k-mer 31 is suggested). You have the option of downloading the index or building it yourself. Prebuilt indices constructed from Ensembl reference transcriptomes can be download from the kallisto transcriptome indices site. Building indices with kallisto index will often be faster in practice than downloading index files. For example, the kallisto index for the mouse transcriptome takes between 5–10 minutes to build on a standard desktop or laptop. Transcriptome fasta files for model organisms can be downloaded from the Ensembl database. We recommend using cDNA fasta, specifically the *.cdna.all.fa.gz files. kallisto can build indices directly from gzipped files.

If you wish to download the index then navigate to https://github.com/pachterlab/kallistobuspaper_2019/releases/tag/getting_started right-click on Mus_musculus.GRCm38.cdna.all.idx.gz selectCopy Link Address and download this file on your terminal.

$ wget https://github.com/pachterlab/kallistobuspaper_2019/releases/download/getting_started/Mus_musculus.GRCm38.cdna.all.idx.gz

And the decompress (unzip) the index we just downloaded:

$ gunzip Mus_musculus.GRCm38.cdna.all.idx.gz

Building the index yourself

We first need to decompress (unzip) the reference fasta file we downloaded.

$ gunzip Mus_musculus.GRCm38.cdna.all.fa.gz

Now we can build the kallisto index. We recommend naming the index Mus_musculus.GRCm38.cdna.all.idx and using a kmer size of 31. Note that a kmer size must always be odd and defaults to 31.

$ kallisto index -i Mus_musculus.GRCm38.cdna.all.idx -k 31 Mus_musculus.GRCm38.cdna.all.fa

[build] loading fasta file Mus_musculus.GRCm38.cdna.all.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 641 target sequences
[build] warning: replaced 3 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 734746 contigs and contains 100614952 k-mers

2b. Gene map

As above, decompress (unzip) the genome GTF file we downloaded.

$ gunzip Mus_musculus.GRCm38.96.gtf.gz

Next make transcript to gene map using t2g.py script to parse the mouse GTF file

$ ./t2g.py --use_version < Mus_musculus.GRCm38.96.gtf > transcripts_to_genes.txt

2c. Results

After building the kallisto index from the reference fasta file and parting the GTF files to create the transcripts to genes map you should have the following files:

kallisto_bustools_getting_started/
├── Mus_musculus.GRCm38.96.gtf
├── Mus_musculus.GRCm38.cdna.all.fa
├── Mus_musculus.GRCm38.cdna.all.idx
├── SRR8599150_S1_L001_I1_001.fastq.gz
├── SRR8599150_S1_L001_R1_001.fastq.gz
├── SRR8599150_S1_L001_R2_001.fastq.gz
├── t2g.py
├── transcripts_to_genes.txt
└── 10xv2_whitelist.txt

0 directories, 8 files

3. Pseudoalign the reads with `kallisto bus`

The 10x Chromium v2 technology was used to generate the data we downloaded above. The technology dictates the Barcode/UMI structure and the whitelist used for barcode error correction. We have to specify the technology in the kallisto bus command and the whitelist in the bustools command.

3a. Run `kallisto bus` to pseudoalign the reads

This will create a BUS file which will be located in bus_output/.

$ kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -o bus_output/ -x 10xv2 -t 10 SRR8599150_S1_L001_R1_001.fastq.gz SRR8599150_S1_L001_R2_001.fastq.gz

[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] will process sample 1: SRR8599150_S1_L001_R1_001.fastq.gz
                               SRR8599150_S1_L001_R2_001.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 8,860,361 reads, 3,431,849 reads pseudoaligned

Note: For running kallisto bus you need an even number of fastq files and the order of the .fastq files is important, R1 comes first then R2 goes second. Please see the Tutorials page if you want to know how to process more than one set of fastq files in one go.

3b. Results

Pseudoalign the single-cell RNA-seq reads using kallisto bus. You should have the following files

kallisto_bustools_getting_started/
├── bus_output
│   ├── matrix.ec
│   ├── output.bus
│   ├── run_info.json
│   └── transcripts.txt
├── Mus_musculus.GRCm38.96.gtf
├── Mus_musculus.GRCm38.cdna.all.fa
├── Mus_musculus.GRCm38.cdna.all.idx
├── SRR8599150_S1_L001_I1_001.fastq.gz
├── SRR8599150_S1_L001_R1_001.fastq.gz
├── SRR8599150_S1_L001_R2_001.fastq.gz
├── t2g.txt
├── transcripts_to_genes.txt
└── 10xv2_whitelist.txt

1 directories, 12 files

4. Process the BUS file with `bustools`

bustools allows us to go from a BUS file, to a equivalence-class-UMI count matrix or a gene-UMI count matrix that can be loaded directly into python for analysis. We will use bustools to do the following:

Correct the barcodes using bustools correct: fix the barcodes that are within one hamming distance of the barcodes in the whitelist using whitelist.txt,
Sort the busfile using bustools sort: organize the busfile by barcode, UMI, set, and multiplicity, and
Count records in the BUS with bustools count: generate the UMI count matrix using transcripts_to_genes.txt.

First navigate to your bus output directory. Your folder should contain the following items:

$ cd bus_output/
$ ls -1
matrix.ec
output.bus
run_info.json
transcripts.txt

4a. Correct the barcodes in the busfile with `bustools correct` and the `whitelist.txt`

This makes a corrected bus file output.correct.bus

$ bustools correct -w ../10xv2_whitelist.txt -o output.correct.bus output.bus
Found 737280 barcodes in the whitelist
Number of hamming dist 1 barcodes = 20550336
Processed 3431849 bus records
In whitelist = 3281671
Corrected = 36927
Uncorrected = 113251

4b. Sort the busfile with `bustools sort`

This makes a sorted bus file output.correct.sort.bus. This step cannot be skipped. Sorting takes BUS records that are the same and rearranges them into one BUS record with multiplicity column indicating how many times those records were seen. Note that this is different than UMI collapsing and serves the purpose of making the bus file smaller and making UMI counting more efficient.

$ bustools sort -t 4 -o output.correct.sort.bus output.correct.bus
Read in 3318598 number of busrecords

4c. Count the UMIs in the busfile with `bustools count` and the `transcripts_to_genes.txt`

For organization first make the following two folders:

$ mkdir eqcount
$ mkdir genecount

To make the Transcript Compatibility Count (TCC) Matrix we want the default output of bustools count

$ bustools count -o eqcount/tcc -g ../transcripts_to_genes.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus
bad counts = 0, rescued  =38627, compacted = 65899

To make the Gene Count Matrix we need to give it the --genecounts option

$ bustools count -o genecount/gene -g ../transcripts_to_genes.txt -e matrix.ec -t transcripts.txt --genecounts output.correct.sort.bus
bad counts = 0, rescued  =0, compacted = 0

Bustools will output the matrices in .mtx format, and gene names in a file ending as .genes.txt and barcodes in a file ending with .barcodes.txt. These can then be loaded into python or R for further analysis.

Results

After using bustools to correct, sort and count the entries in the BUS file, your files should be structured like this:

kallisto_bustools_getting_started/
├── bus_output
│   ├── eqclass
│   │   ├── tcc.barcodes.txt
│   │   ├── tcc.ec.txt
│   │   └── tcc.mtx
│   ├── genecount
│   │   ├── gene.barcodes.txt
│   │   ├── gene.genes.txt
│   │   └── gene.mtx
│   ├── matrix.ec
│   ├── output.bus
│   ├── output.correct.bus
│   ├── output.correct.sort.bus
│   ├── run_info.json
│   └── transcripts.txt
├── Mus_musculus.GRCm38.96.gtf
├── Mus_musculus.GRCm38.cdna.all.fa
├── Mus_musculus.GRCm38.cdna.all.idx
├── SRR8599150_S1_L001_I1_001.fastq.gz
├── SRR8599150_S1_L001_R1_001.fastq.gz
├── SRR8599150_S1_L001_R2_001.fastq.gz
├── t2g.py
├── transcripts_to_genes.txt
└── 10xv2_whitelist.txt

3 directories, 20 files

5. Load and analyze the matrices with python/R

Analysis of matrices with python: python-notebook

Analysis of matrices with R: R-notebook

Other useful tutorial notebooks on the BUStools repository include the 10x_hgmm_100 notebook which details the analysis of a small, and therefore easily downloadable dataset. Links to other tutorial notebooks are posted on the BUStools python notebook website and the BUStools R notebook website.