REF-EVAL computes a number of reference-based scores. These scores measure the quality of a transcriptome assembly relative to a collection of reference sequences. For information about how to run REF-EVAL, see “Usage” and the sections following it below. For information about the score definitions, see “Score definitions” and the sections following it below.
As an optional first step, estimate the “true” assembly, using REF-EVAL-ESTIMATE-TRUE-ASSEMBLY. Alternatively, you can use the full-length reference transcript sequences directly as a reference.
From now on, we will call the estimated “true” assembly or the collection of full-length reference sequences (whichever you choose to use) the reference. Let's assume that the assembly of interest is in A.fa, and the reference is in B.fa.
If you want to compute alignment-based scores (see --scores below for more info), align the assembly to the reference and vice versa using Blat. We recommend fairly unrestrictive settings, in order to generate many candidate alignments.
$ blat -minIdentity=80 B.fa A.fa A_to_B.psl $ blat -minIdentity=80 A.fa B.fa B_to_A.psl
If you want to compute weighted variants of scores, use RSEM to compute the expression of the assembly and reference relative to the given reads. Let's assume that the reads are in reads.fq.
$ rsem-prepare-reference --no-polyA A.fa A_ref $ rsem-prepare-reference --no-polyA B.fa B_ref $ rsem-calculate-expression -p 24 --no-bam-output reads.fq A_ref A_expr $ rsem-calculate-expression -p 24 --no-bam-output reads.fq B_ref B_expr
Finally, run REF-EVAL. To compute everything, run:
$ ./ref-eval --scores=nucl,pair,contig,kmer,kc \ --weighted=both \ --A-seqs A.fa \ --B-seqs B.fa \ --A-expr A_expr.isoforms.results \ --B-expr B_expr.isoforms.results \ --A-to-B A_to_B.psl \ --B-to-A B_to_A.psl \ --num-reads 5000000 \ --readlen 76 \ --kmerlen 76 \ | tee scores.txt
To only compute the kmer compression score (and its dependencies), run:
$ ./ref-eval --scores=kc \ --A-seqs A.fa \ --B-seqs B.fa \ --B-expr B_expr.isoforms.results \ --num-reads 5000000 \ --readlen 76 \ --kmerlen 76 \ | tee scores.txt
To only compute the unweighted reference-based scores, run:
$ ./ref-eval --scores=nucl,pair,contig \ --weighted=no \ --A-seqs A.fa \ --B-seqs B.fa \ --A-to-B A_to_B.psl \ --B-to-A B_to_A.psl \ | tee scores.txt
To only compute the scores discussed in the DETONATE paper, run:
$ ./ref-eval --paper \ --A-seqs A.fa \ --B-seqs B.fa \ --B-expr B_expr.isoforms.results \ --A-to-B A_to_B.psl \ --B-to-A B_to_A.psl \ --num-reads 5000000 \ --readlen 76 \ --kmerlen 76 \ | tee scores.txt
The scores will be written to standard output (hence, above, to scores.txt). Progress information is written to standard error. Further details about the arguments to REF-EVAL are described below. Further details about the scores themselves are given under “Score definitions” below.
The groups of scores to compute, separated by commas (e.g., --scores=nucl,contig,kc). It is more efficient to compute all the scores you are interested in using one invocation of REF-EVAL instead of using multiple invocations that each compute one score. The available score groups are as follows:
Alignment-based score groups:
Alignment-free score groups:
Required unless --paper is given.
A string indicating whether to compute weighted or unweighted variants of scores, or both (e.g., --weighted=yes):
In weighted variants, the expression levels (TPM) of the assembly and reference sequences are taken into account, and hence need to be specified using --A-expr and --B-expr. Unweighted variants are equivalent to weighted variants with uniform expression.
The distinction between weighted and unweighted variants doesn't make sense for the KC score, so this option is ignored by the KC score.
Required unless --paper or only --score=kc is given.
As an alternative to the above, if you are only interested in computing the scores described in the main text of our paper [1], you can pass the --paper flag instead of the --scores and --weighted options. In that case, the following scores will be computed:
Alignment-based scores:
Alignment-free score groups:
For obvious reasons, the --scores and --weighted options are incompatible with this flag.
[1] Bo Li*, Nathanael Fillmore*, Yongsheng Bai, Mike Collins, James A. Thompson, Ron Stewart, Colin N. Dewey. Evaluation of de novo transcriptome assemblies from RNA-Seq data.
The assembly sequences, in FASTA format. Required.
The reference sequences, in FASTA format. Required.
The assembly expression, for use in weighted scores, as produced by RSEM in a file called *.isoforms.results. Required for weighted variants of scores.
The reference expression, for use in weighted scores, as produced by RSEM in a file called *.isoforms.results. Required for weighted variants of scores.
The alignments of the assembly to the reference. The file format is specified by --alignment-type. Required for alignment-based scores.
The alignments of the reference to the assembly. The file format is specified by --alignment-type. Required for alignment-based scores.
The type of alignments used, either blast or psl. Default: psl. Currently BLAST support is experimental, not well tested, and not recommended.
If this flag is present, it is assumed that all the assembly and reference sequences have the same orientation. Thus, alignments or kmer matches that are to the reverse strand are ignored.
This option only applies to the KC scores. The read length of the reads used to build the assembly, used in the denominator of the ICR. Required for KC scores.
This option only applies to the KC scores. The number of reads used to build the assembly, used in the denominator of the ICR. Required for KC scores.
This option only applies to the kmer and KC scores. This is the length (“k”) of the kmers used in the definition of the KC and kmer scores. Required for KC and kmer scores.
This option only applies to contig scores. Alignments with fraction identity less than this threshold are ignored. The fraction identity of an alignment is min(x/y, x/z), where
Default: 0.99.
This option only applies to contig scores. Alignments with fraction indel greater than this threshold are ignored. For psl alignments, the fraction indel of an alignment is $\max(w/y, x/z)$, where
For blast alignments, the fraction indel of an alignment is $\max(x/y, x/z)$, where
Default: 0.01.
This option only applies to nucleotide and pair scores. Alignment segments that contain fewer than this number of bases will be discarded. Default: 100. In the DETONATE paper, this was set to the read length.
The type of hash table to use, either “sparse” or “dense”. This is only relevant for KC and kmer scores. The sparse table is slower but uses less memory. The dense table is faster but uses more memory. Default: “sparse”.
The numeric type to use to store values in the hash table, either “double” or “float”. This is only relevant for KC and kmer scores. Using single-precision floating point numbers (“float”) requires less memory than using double-precision (“double”), but may also result in more numerical error. Note that we use double-precision numbers throughout our calculations even if single-precision numbers are stored in the table, so the additional error should be minimal. Default: “double”.
This is only relevant for KC and kmer scores. When the hash table is created, its initial capacity is set as the total worst-case number of possible kmers in the assembly and reference, based on each sequence's length, divided by the fudge factor. The default, 2.0, is often reasonable because (1) most kmers should be shared by the assembly and the reference, and (2) many kmers will be repeated several times. However, if you have a lot of memory or a really bad assembly, you could try a smaller number. Default: 2.0.
If given, the prefix for additional output that provides details about the REF-EVAL scores; if not given, no such output is produced. Currently, the only such output is as follows.
Display this information.
In the next few sections, we define the scores computed by REF-EVAL. Throughout, $A$ denotes the assembly, and $B$ denotes the reference. (As discussed under “Usage” above, the reference can be either an estimate of the “true” assembly or a collection of full-length reference transcripts.) Both $A$ and $B$ are thought of as sets of sequences. $A$ is a set of contigs, and $B$ is a set of reference sequences.
The contig recall is defined as follows:
The contig precision is defined as follows: Interchange the assembly and the reference, and compute the contig recall.
The contig F1 is the harmonic mean of the precision and recall.
The nucleotide recall is defined as follows:
The actual implementation uses a more complicated and efficient algorithm than the one above.
If --weighted=yes, then (i) “the number of identical bases” above is replaced by “the number of identical bases, times $\tau(b)$”, in the definition of the priority and the numer, and (ii) “total number of bases in the reference $B$” is replaced by “$\sum_{b \in B} \tau(b) length(b)$”. In other words, each base (of a reference sequence), throughout the computation, is weighted by the expression level of its parent sequence.
The nucleotide precision is defined as follows: Interchange the assembly and the reference, and compute the nucleotide recall.
The nucleotide F1 is the harmonic mean of the precision and recall.
Alignment subtraction is defined as follows.
A couple of examples of the above are as follows. The comments in test_re_matched.cpp contain even more examples.
As a first example, consider alignments of an assembly $A = \{a_0, a_1, a_2\}$ to a reference $B = \{b_0, b_1\}$. In the pictures below, each alignment segment is indicated by a pair diagonal or vertical lines, with its name (initially $x$, $y$, $z$, $w$) in between the two lines.
b0 b1 B ----------------- ------------- / \ / \ / | / | / \ / \ / |/ | / x / y \ z / w | / / \ / \ /| | A --------- --------- --------- a0 a1 a2
Assume:
Step 1: Process alignment $x$, resulting in
b0 b1 B ------------------ -------------- / /\ \ / | / | / / \ \ / |/ | / x / \*\ z / w | * = y - x / / / \ /| | A --------- --------- --------- a0 a1 a2
Step 2: Process alignment $z$, resulting in
b0 b1 B ----------------- -------------- / / / /|| / / / / || / x / / z / *| * = w - z / / / / || A --------- --------- --------- a0 a1 a2
Now we have a 1-1 mathing. The intervals of $B$ used to compute recall are as follows:
b0 b1 B -----[--------]-- [----][]------ / / / /|| ...
As a second example, we start with the same initial set of alignments:
b0 b1 B ----------------- ------------- / \ / \ / | / | / \ / \ / |/ | / x / y \ z / w | / / \ / \ /| | A --------- --------- --------- a0 a1 a2
But we make a slightly different set of assumptions:
Step 1: Process alignment $x$, resulting in
b0 b1 B ------------------ -------------- / /\ \ / | / | / / \ \ / |/ | / x / \*\ z / w | * = y - x / / / \ /| | A --------- --------- --------- a0 a1 a2
Step 2: Process alignment w, resulting in
b0 b1 B ------------------ -------------- / /\ \ / | | / / \ \ / /| | / x / \*\ +/ | w | * = y - x / / / \/ | | + = z - w A --------- --------- --------- a0 a1 a2
Step 3: Process alignment $y - x$, resulting in
b0 b1 B ------------------ -------------- / /\ \ +/ | | / / \*\// | | / x / \ \ | w | * = y - x / / // \ | | + = (z - w) - (y - x) A --------- --------- --------- a0 a1 a2
Now we have a 1-1 mathing. The intervals of $B$ used to compute recall are:
b0 b1 B -----[--------][-] []-[---]------ / /\ \ // | | * = y - x x * + w + = (z - w) - (y - x) ...
The definitions for pair precision, recall, and F1 are exactly the same as for nucleotide precision, recall, and F1, except that instead of bases, we operate on pairs of bases.
For example, consider the reference $B$ (with one transcript) and assembly $A$ (with two contigs), where horizontal position indicates alignment.
B: b = AGCTCGACGT A: a_1 = AGCT a_2 = CGACGT
Here, the transcript recall is 0 (because neither $a_1$ nor $a_2$ covers $b$ to $\geq$ 99 percent), but the nucleotide recall is 1 (because $a_1$ and $a_2$ jointly cover $b$ completely). The pair recall is somewhere in between, because the following pairs of $b$ are correctly predicted (represented by an upper triangular indicator matrix):
First base S AGCTCGACGT e A 1111 c G 111 o C 11 n T 1 d C 111111 G 11111 b A 1111 a C 111 s G 11 e T 1
The kmer compression score (KC score) is a combination of two measures, weighted kmer recall (WKR) and inverse compression rate (ICR), and is simply
$$ KC = WKR - ICR. $$The WKR measures the
To compute the WKR, the relative abundances of the reference elements are required, as specified by --B-expr. Given the reference sequences and their abundances, a kmer occurrence frequency profile, $p$, is computed, with individual kmer occurrences weighted by their parent sequences' abundances: for each kmer $r$, we define
$$ p(r) = \frac{ \sum_{b \in B} n(r,b) \tau(b) } { \sum_{b \in B} n(b) \tau(b) } $$where $B$ is the set of reference sequences, and for each reference sequence $b \in B$:
Letting $R(A)$ be the set of all kmers in the assembly $A$, the weighted kmer recall (WKR) is defined as
$$ WKR = \sum_{r \in R(A)} p(r). $$REF-EVAL currently uses --readlen as the kmer length.
Since recall measures only tell half of the story regarding accuracy, the KC score includes a second term, the ICR, which serves to penalize large assemblies. We define the inverse compression rate (ICR) of an assembly as
$$ ICR = n_A/(N L), $$where
If --weighted=yes, we construct a kmer occurrence frequency profile $p_B$ for $B$ exactly as described in the previous section (about the KC score). We construct a kmer occurrence frequency profile $p_A$ for $A$ similarly. The relative abundances are specified by --A-expr and --B-expr.
If --weighted=no, we construct the kmer occurrence frequency profiles $p_A$ and $p_B$ in the same way, except that uniform relative abundances are used, i.e., $\tau(a) = 1/|A|$ for all $a$ in $A$, and $\tau(b) = 1/|B|$ for all $b$ in $B$, where $|A|$ is the number of contigs in $A$, and $|B|$ is the number of reference sequences in $B$.
Let $m$ be the “mean” profile of $p_A$ and $p_B$:
$$ m(r) = (1/2) (p_A(r) + p_B(r)) \qquad\hbox{for every kmer $r$}. $$The Jensen-Shannon divergence between $p_A$ and $p_B$ is defined in terms of the KL divergence between $p_A$ and the mean, and $p_B$ and the mean, as follows:
In the output file, these three scores are denoted (un)weighted_kmer_KL_A_to_M, (un)weighted_kmer_KL_B_to_M, and (un)weighted_kmer_jensen_shannon, respectively.
The Hellinger distance between $p_A$ and $p_B$ is defined as
$$ \sqrt{ (1/2) \sum_r (\sqrt{p_A(r)} - \sqrt{p_B(r)})^2 } $$The total variation distance between $p_A$ and $p_B$ is defined as
$$ (1/2) \sum_r |p_A(r) - p_B(r)|, $$where $|\cdot|$ denotes absolute value. Above, $\sum_r$ denotes a sum over all possible kmers $r$ (most of which will have $p_A(r) = p_B(r) = 0$).