REF-EVAL-ESTIMATE-TRUE-ASSEMBLY: A program to estimate the “true” assembly of a set of reads, relative to a set of reference sequences

Overview

This program constructs an estimate of the “true” assembly of a set of reads, relative to a set of reference sequences, based on alignment information produced by RSEM.

As defined by DETONATE [1], the “true” assembly of a set of reads, relative to the true transcript sequences that the reads were generated from, is the set of contiguous subsequences of the transcript sequences that are covered by reads, with the reads overlapping by at least --min-overlap bases, when each read is aligned to its true location of origin.

In practice, we do not know the reads' true locations of origin; in fact, we do not even know (precisely) the true transcript sequences that the reads were generated from. Instead, we start with a set of reference sequences (e.g., an Ensembl reference), and we align each read to these reference sequences using RSEM/Bowtie. The current program chooses a subset of these alignments, according to a policy specified by --alignment-policy and --min-alignment-prob, and then builds our best guess as to the “true” assembly based on these alignments.

[1] Bo Li*, Nathanael Fillmore*, Yongsheng Bai, Mike Collins, James A. Thompson, Ron Stewart, Colin N. Dewey. Evaluation of de novo transcriptome assemblies from RNA-Seq data.

Example usage

First, use a recent version of RSEM to quantify the expression of the full-length transcripts relative to the reads. For example:

$ rsem-prepare-reference --gtf mm9.ensembl63.filtered.gtf mm9.fa ref
$ rsem-calculate-expression --num-threads 24 reads.fq ref expr

Second, use this program to estimate the “true” assembly:

$ ./ref-eval-estimate-true-assembly --reference ref --expression expr --assembly cc

This will output a file, cc_0.fa, that contains the “true” assembly.

General options

-? [ --help ]

Display this information.

Options that specify input and output

--reference arg

The prefix of the reference built by rsem-prepare-reference. Required.

--expression arg

The prefix of the expression built by rsem-calculate-expression. Required.

--paired-end

If you have paired-end data, and you want to estimate the “true” scaffolded assembly, then include the --paired-end flag. In this case, rsem-calculate-expression needs to have been run with the --paired-end flag. (However, even if rsem-calculate-expression was run with the --paired-end flag, you can omit it here in order to generate an unscaffolded assembly. In this case, each mate will be treated as an independent read.)

--assembly arg

A prefix to write the “true” assembly or sequence of assemblies to. The suffix _x.fa will be appended to this prefix, where x is the minimum overlap size. Required.

Options that change the output

--min-overlap arg

Either:

Or:

Default: 0.

--min-alignment-prob arg

A number between 0 and 1 (inclusive). Any alignment (of a read to a reference transcript) with posterior probability, as calculated by RSEM, strictly less than this value will be discarded. Noise reads, with posterior probability exactly 0, are always discarded. Default: 0.

--alignment-policy arg

The policy used to choose which alignment(s) of each read to use in constructing the “true” assembly. Options:

This policy is applied after the thresholding implied by --min-alignment-prob. For example, if “--min-alignment-prob=0.10 --alignment-policy=sample” is given, then (first) all alignments with posterior probability less than 0.10 will be discarded, and (second), for each read, an alignment will be sampled from among the remaining alignments, with the posterior distribution renormalized as appropriate. As another example, if “--min-alignment-prob=0.90 --alignment-policy=all” is given, then all alignments with posterior probability at least 0.90 will be used.

Default: sample.