README for the paper "RNA-Seq gene expression estimation with read mapping uncertainty"

README for the paper "RNA-Seq gene expression estimation with read mapping uncertainty"

Bo Li(bli at cs dot wisc dot edu)

Table of Contents

Overview

Usage

RSPD

Simulation

License

Overview

RSEM is a package for inferring gene expression levels via RNA-Seq data.To run this program, you need to download our zip file first. Also, you need to install bowtie aligner.

In current version, we provide 4 functions. You can:

Infer gene/isoform expression levels from your own data
Estimate gene expression levels' mean and standard error via bootstrap
Learn RSPD from your own data
Generate simulated data set by yourself

In order to infer gene/isoform expression levels, you should follow these steps:

Prepare reference genome sequences
Build indices for bowtie & use bowtie to produce alignments
Calculate nus & taus by using EM algorithm

We provide a program for Step III.

For the whole document, I take mouse genome as an example.

For all the following applications, you need to configurate a configuration file(.config) first. The format is:

READTLEN READELEN READMLEN MISMATCH ALIGNLIM

READTLEN: the length of reads in the read file
READELEN: the length of read which you want to align via bowtie
READMLEN: the minimum length of nucleotides in a read aligned within the reference sequences. For polyA+ samples, We add polyA tails at the end of reference sequences, which are not within reference sequences; For polyA- samples, READMLEN should be set as 1.
MISMATCH: the maximum mismatches allowed
ALIGNLIM: The program will discard all reads which align to bigger than or eqaul ALIGNLIM genes and mask the corresponding positions.

Please make sure that READTLEN == READELEN

We provide a makefile in the zip file!

Caution:

RSEM expects a certain amount of unalignable reads in the data (which is often true in real case).

EM_double & EM_RSPD are modified to be able to run with 0 unalignable reads, but the results are not tested.

bootstrap may fail to finish if it encounters 0 unalignable reads.

Usage

We introduce how to use RSEM to infer Nus and Taus here.

You should have a file contains all the reads(We call it "readFN.fa". It must end with a suffix ".fa". "readFN" will use as the task name). Also, you should set up a configuration file, "readFN.config" and have reference sequences ready."readFN.fa" must be in FASTA format

For the mouse example, we have "mmliver.fa" & "mmliver.config". "mmliver.config" may be like this:

25 25 10 2 100

I. Prepare Reference Genome Sequences
(Related Source Codes: preRefG_1_i.pl, prepareReferenceGenome.cpp)

1. For the species interested, download all known genes' sequences from UCSC table browser. Also download the knownGenes file and knownIsoforms file from UCSC.For the example, we denote the reference sequences' file as "mouse.fasta", knownGene file as "knownGenes.txt", knownIsoforms file as "knownIsoforms.txt"

In order to get reference sequences file, you first go to UCSC Genome Browser website, then find the "Tables" in the top line. You'll redirect to the following link:
http://genome.ucsc.edu/cgi-bin/hgTables?command=start

You set "genome" option as you want.
You set "group" as "Genes and Gene Prediction Tracks".
You set "track" as "UCSC Genes".
You set "table" as "knownGene".
You set "output format" as "sequence".
You set "output file" as the file name you want to store all the sequences.
You keep all other settings as default.

Then you click "get output" button. Choose "genomic" option and then click "submit". Then under "Sequence Retrieval Region Options:", you select only "5' UTR Exons", "CDS Exons" and "3' UTR Exons" and "One FASTA record per gene". Under "Sequence Formatting Options", choose "Exons in upper case, everthing else in lower case", select "Mask repeats" and choose "to lower case". Then click get sequence

2. Prepare reference sequences , Stage I. Use command:

perl preRefG_1_i.pl refFile.fasta knownIsoforms outFileName prefix

refFile.fasta: the fasta file for reference sequences
knownIsoforms: the knownIsoforms file
outFileName: the name for output files. It will generate three output files. "outFileName.raw.fa", the stage 1 reference sequences; "outFileName.bas", some basic information from input fasta file; "outFileName.grp", record the isoforms belong to each gene
prefix: the tag of the fasta file downloaded contains the isoform id. However, there is some useless string before the isoform id. You should figure it out so that the program can ignore it. E.g., for mouse.fasta, the first line is "mm9_knownGene_uc007aet.1 range=..." , uc007aet.1 is the isoform id, and prefix should be "mm9_knownGene_". Please do not contain ">" in prefix. ">" will be eliminated by program automatically.

Please NOTE that you need to install BIOPERL in order to use the above script.

Example:

perl preRefG_1_i.pl mouse.fasta knownIsoforms.txt mouse mm9_knownGene_

3. Prepare reference sequences, Stage II. Use command:

preRefG_2 readFN refFN hasPolyA

readFN: the file name of read file (excluding file suffix)
refFN: the file name of reference file (excluding file suffix)
hasPolyA: 0, samples do not contain polyA tails; 1, samples contain polyA tails.

It will first load readFN.config and then output two files. "refFN.fa" & "refFN.ref". "refFN.fa" is the file used by bowtie aligner. "refFN.ref" is the reference file used by our program.

Example:

preRefG_2 mmliver mouse 1

//Output mouse.fa & mouse.ref

II. Build indices for bowtie & Use bowtie to produce alignments
(Related Source Codes: buildBowtie.cpp, runBowtie.cpp)

Assume that you can use bowtie on your current folder. We provide two wrappers over bowite.

1. To build bowtie index, use the following command:

buildBowtie offrate ftabchars reference indexName

For the meaning of offrate & ftabchars, please refer to bowtie manual
reference: the reference file needed, it is the "refFN.fa" produced by preRefG_2
indexName: the name of the index generated

Example:

buildBowtie 2 12 mouse.fa mouse

You'll get a banch of .ebwt files, which are the indices built by bowtie aligner.

2. To align reads by bowtie, use the following command:

runBowtie mismatch trimlen readF(input) bowtieFName(output) #core indexName

mismatch: # of mismatch allowed
trimlen: # of letters trimmed at the end of the read, it should be equal to READTLEN - READELEN
readF: the file contains all reads need to align
bowtieFName: the output file's name (excluding suffix)
#core: How many CPUs you want to use
indexName: the name of index built by bowtie

Example:

runBowtie 2 0 mmliver.fa mmliver 8 mouse

You'll get a file named bowtieFName.bowtie. for the above example , it will be "mmliver.bowtie"

Notice: runBowtie does not work on bowtie version 0.10.0 or later. User can consider using "bowtie" command provided by bowtie aligner directly

III. Calculate Nus & Taus by Using EM Algorithm
(Related Source Codes : consts.h, functions.h, ReadReader.h, RefSeq.h, Wiggle.h, EvalError.h, bowtieParser_C.cpp, estParam_bowtie.cpp, calcReadProb_double.cpp, EM_double.cpp, runPipeLine.cpp)

1. Parse the alignments obtained from bowtie:

bowtieParser_C readFN refFN readF alignF

readFN & refFN defined previously.
readF: is the full name of the fasta file containing all reads (including suffix)
alignF: is the full name of the alignment file obtained by bowtie (including suffix)

It will produce three files, "readFN.dat", "readFN.rsq", "readFN.cnt".

"readFN.dat": a summary of bowtie file, which is used by our programs.
"readFN.rsq": a masked version of reference sequences. It should be used instead of ".ref" in following phases.
"readFN.cnt": records some important number from alignment file. Its format is as follows:

#_of_unmapped_reads_after_filtering     #_of_mappable_reads_after_filtering     #_of_total_reads
#_of_unique_mapped_reads_after_filtering     #_of_multi-reads_after_filtering
#_of_unmapped_reads_before_filtering     #_of_mappable_reads_before_filtering     #_of_unique_mapped_reads_before_filtering     #_of_multi-reads_before_filtering
#_of_alignments_dropped     #_of_reads_filtered
#_of_alignments_after_filtering

Example:

bowtieParser_C mmliver mouse mmliver.fa mmliver.bowtie

It produces mmliver.dat, mmliver.rsq, mmliver.cnt

2. Estimate parameters other than the theta vector:

estParam_bowtie readFN refFN readF

readFN, refFN & readF are described previously.

It produces one output file, "readFN.param". The first line describes the "noise" gene emission probability distribution. There are 5 values, standing for A, C, G, T, N respectively. The second line is READTLEN. Then follows READTLEN blocks. They describe the profile matrices, w_t(a,b). t indeices block, b indices rows in each block and a indices columns in each block. The order is A, C, G, T, N for both a & b.

Example:

estparam_bowtie mmliver mouse mmliver.fa

//It produces mmliver.param

3. Calculate P(R_n = rho | Z_nijk = 1):

calcReadProb_double readFN refFN

readFN & refFN are described before.

It produces "readFN_double.rp". _double means that this program uses double directly (without taking log).

Example:

calcReadProb_double mmliver mouse

//It produces mmliver_double.rp

4. EM algorithm. To infer Nu & Tao, use the following command:

EM_double readFN refFN reportFN_double [-g gtFN] [-w] [-t]

readFN & refFN are the same as before.
reportFN_double is the file name for reported results (excluding suffix)

It will produce 6 files:

reportFN_double.genes.nu: the Nus for genes. First line is the number of genes. The second line contains nu for each gene.
reportFN_double.genes.tau: the Taus for genes. First line is the number of genes. The second line contains tau for each gene.
reportFN_double.isoforms.nu: the Nus for isoforms. First line is the number of isoforms. The second line contains nu for each isoform.
reportFN_double.isoforms.tau: the Taus for isoforms. First line is the number of isoforms. The second line contains tau for each isoform.
reportFN_double.theta0: fraction of "noise" isoform among all reads after correction
reportFN_double.thetap0: fraction of "noise" isoform among all reads before correction

Options:

-g gtFN: only valid for running on simulation dataset. gtFN is the filename (excluding suffix) of ground truth. This option is used for internal test only.
-w: tells the program that you want the wiggle vectors printed out. It produces "reportFN_double.vector". The file format is as follows: The first line is the number of isoforms M. Then there are M blocks, each block describes one isoform's wiggle vector. For each block, the first line is the length of the vector(If pad polyA, it is REFLEN, the length of the original sequence; if not, it is max(0, REFLEN - READTLEN + 1)). Then followed by the expected counts for each position in forward strand. And then expected counts for each position in the reverse strand. If the length of the vector is 0, the next two lines are blank. This option is not tested for correctness, not recommended to use!
-t: tells the program that you want to print out theta' as well. It produces "reportFN_double.thetap". The first line is the length of the vector and then followed by the vector itself

PolyA- Samples: If you use polyA- samples, values in *.nu is not the Nu value defined in the paper.

Example:

EM_double mmliver mouse mmliver_deweylab

//It produces mmliver_deweylab.genes.nu, mmliver_deweylab.genes.tau, mmliver_deweylab.isoforms.nu, mmliver_deweylab.isoforms.tau, mmliver_deweylab.theta0, mmliver_deweylab.thetap

5. Pipe Line. There is a pipeline avaible for this phase. Use the following command:

runPipeLine MISMATCH READ[E|T]LEN READMLEN ALIGNLIM readFN refFN readF alignF [-g gtFN] [-w] [-t] [-path Path]

The parameters are defined the same as before. READELEN & READTLEN are tied together here. Also it assumes that reportFN_double = readFN_deweylab.

You can set the path of rsem's executable programs by using -path Path. Please do not include slash at the end of path name ("/rsem" instead of "/rsem/"). The default value is "."(current folder).

Example:

runPipeLine 2 25 10 100 mmliver mouse mmliver.fa mmliver.bowtie -w -t -path /rsem

Bootstrap

You should infer the expression levels by EM_double first.
(Related codes: consts.h RefSeq.h functions.h bootstrap.cpp calcStats.cpp; randomc.h mersenne.cpp. The last two are from http://lxnt.info/rng/randomc.htm)

I. Use bootstrap to generate bootstrap estimates. The command is:

bootstrap readFN refFN seedF reportFN(Path Included) id nProcess nTask

readFN, refFN: previous described
seedF: a file contains all the random seeds for generating bootstrap estimates. The first line is a number indicating total number of seeds(we call it nSeed. nSeed == nTask). The second line contains nSeed random seeds. Each seed is a 32bit unsigned integer
reportFN: the output file name of ith bootstrap estimates is "reportFN_i.thetap". You can include path name if you do not want to save them under current folder
id: process id, starts from 0
nProcess: number of processes used
nTask: number of bootstrap estimates wanted

This code is written for parallel purporse. But user can easily figure out how to use it in one process.

Example:

bootstrap mmliver mouse mmliver.seeds mmliver_bootstrap 0 1 1000

II. Calculate mean and standard error. The command is:

calcStats readFN refFN N thetapFN(pach included) nuF tauF

readFN, refFN: previous described
N: number of data files need to be processed, is equal to nTask above
thetapFN: has the same meaning of reportFN
nuF: output file for nu values
tauF: output file for tau values

Example:

calcStats mmliver mouse 1000 mmliver_bootstrap mmliver_nu_table.txt mmliver_tau_table.txt

RSPD

You just need to replace EM_double with EM_RSPD(Related codes: Hit.h, RSPD.h, EM_RSPD.cpp). The command is:

EM_RSPD readFN refFN reportFN RSPD_B RSPD_INTERVAL [-g gtFN]

We only need to explain RSPD_B and RSPD_INTERVAL.

RSPD_B: # of parameters. We use 10,000.
RSPD_INTERVAL: every RSPD_INTERVAL rounds, EM algorithm will re-learn the RSPD from data. We use 5.

It will produce 6 files. They are: reportFN.genes.nu, reportFN.isoforms.nu, reportFN.genes.tau, reportFN.isoforms.tau reportFN.thetap0 and reportFN.rspd

reportFN.rspd: record the RSPD parameters estimated from data. The first line in this file is RSPD_B. The second line contains RSPD_B numbers, which are pdf parameters. The third line contains RSPD_B numbers, too. They are cdf and the parameters mentioned in the paper.

PolyA- Samples: If you use polyA- samples, values in *.nu is not the Nu value defined in the paper.

Example:

EM_RSPD mmliver mouse mmliver_deweylab 10000 5

//It produces mmliver.genes.nu, mmliver.isoforms.nu, mmliver.genes.tau, mmliver.isoforms.tau, mmliver.thetap0 and mmliver.rspd

Simulation

You should finish the Phase I in Usage first. Then you'll have the .ref file and the .grp file ready.
(Related codes: consts.h, RefSeq.h, functions.h, simulation.cpp, Hit.h, RSPD.h, simulation_RSPD.cpp ; random.c, mersenne.cpp)

I. If you want to simulate with a uniform RSPD, use simulation.cpp. The command is:

simulation refFN simParamF readLen N outFN

refFN: previously described
simParamF: a file provided by user, which contains all parameters needed for simulator
readLen: the length of reads user wants to simulate
N: number of reads user wants to simulate
outFN: output file name. The simulator will produce "outFN.fa" as reads file, "outFN_[g/e].[genes.nu/genes.tau/isoforms.nu/isoforms.tau/theta0]" as ground/sample truth files.

The format of simParamF is as follows:

theta0
# of isoforms, say M
M tau values, separated by blank
P(O_n = 1 | G_n != 0)
Beta(A) Beta(C) Beta(G) Beta(T) Beta(N)
read length
w_t(a,b)

You can find a sample simParamF here.

Example:

simulation mouse accuracy99.sparam 75 30000000 mmliver_sim

II. If you want to simulate with a user-defined RSPD, use simulation_RSPD.cpp. The command is as follows:

simualtion_RSPD readFN refFN rspdF N

readFN, refFN and N are defined previously. rspdF has the same format as "readFN.rspd" learned in RSPD section.

User should provide "readFN.sparam" and also have "readFN.config" ready.

The format for "readFN.sparam" is:

P(O_n | G_n != 0)
Beta
w_t(a,b)
theta0
# of genes, say nGenes
nGenes numbers, represent genes' theta values

Simulator will split one gene's theta randomly to its isoforms. You can find a sample file here.

simulation_RSPD will produce several files. They are "readFN.fa" and "readFN_[g/e].[genes.nu/isoforms.nu/theta0]".

Example:

simulation_RSPD mmliver_sim mouse mmliver.rspd 30000000

License

RSEM is under the GNU General Public License.

(last modified on May 27, 2010)