Introduction

PSGInfer is a software package for the analysis of RNA-Seq data with probabilistic splice graph (PSG) models of gene alternative processing (splicing, transcription initiation, and polyadenylation). The software currently performs three tasks:

  1. Construction of PSG models from known transcript annotations.

  2. Maximum likelihood (ML) or maximum a posteriori (MAP) estimation of PSG parameters given an RNA-Seq data set.

  3. Prediction of genes that are differentially processed (DP) between two samples, given RNA-Seq data for each sample.

Installation

PSGInfer is currently composed of a number of Python and Java programs and therefore does not require any compilation. After unpacking the PSGInfer tarball (psginfer-X.X.X.tar.gz), simply add the unpacked directory to your PATH environment variable.

Prerequisites

PSGInfer requires that the following software packages be installed and available via your PATH environment variable.

Running PSGInfer

The tasks performed by PSGInfer are run via the following front-end scripts:

Preparing a PSG reference

All analyses with the PSGInfer package require that you first run the psg_prepare_reference.py script. This script constructs PSG models from known transcript annotations and prepares internal index files required by the other scripts. The psg_prepare_reference.py script requires that you provide the following:

The script is run from the command line as

psg_prepare_reference.py [options] reference_name

where reference_name is the name of the output directory that will be created and the name that you will use to refer to the produced set of PSG models. The script has a number of options, which can be listed by running

psg_prepare_reference.py --help

The following command line exemplifies common usage of this script

psg_prepare_reference.py -g genome_directory myreference < annotations.gtf

Outputs

The psg_prepare_reference.py script creates a number of files in the output directory reference_name. Of these, the only file of major interest to users is the index.txt file, which lists the the genes in the reference and statistics about their corresponding PSG models. In addition, the constructed PSG model for gene GENE on chromosome CHROM is provided in the file CHROM/GENE.psg. The format of the PSG file is described later in this document. Other files output by the psg_prepare_reference.py script are primarily for internal use, and include Bowtie indices for performing alignments against the PSG model sequences.

Estimating PSG parameters from an RNA-Seq sample

The estimation of the parameters of the PSG models from an RNA-Seq sample is performed by the psg_infer_frequencies.py script. This script requires as input a set of PSG models (as prepared by psg_prepare_reference.py) and FASTQ-formatted read files. The script is run from the command line as

psg_infer_frequencies.py [options] reference_name sample_name mate1_read_file [mate2_read_file]

where reference_name is the name of the PSG reference (as prepared by psg_prepare_reference.py), sample_name is the name of the output directory that will be created, and mate1_read_file and mate2_read_file are the first and (optionally) second mate FASTQ-formatted read files, respectively. The script has a number of options, which can be listed by running

psg_prepare_reference.py --help

The following command line exemplifies common usage of this script

psg_infer_frequencies.py myref mysample mysample_1.fastq mysample_2.fastq

This script first runs Bowtie to align the RNA-Seq reads against the PSG segment sequences (and junctions) and then uses the resulting alignments for PSG parameter estimation.

Outputs

The primary output of psg_infer_frequencies.py is the file gene.results.txt in the sample_name output directory. This file contains a tab-delimited table containing the results of parameter estimation for each gene’s PSG model. In this table, the columns of main interest are:

A secondary output of psg_infer_frequencies.py is the file isoform.results.txt, which provides within-gene relative frequencies of the known annotated transcripts. These frequencies are computed according to the estimated probabilities of the processing events implied by each annotated transcript. If the PSG for a gene implies additional un-annotated transcripts, then, in general, the frequencies for the known annotated transcripts will not sum to one because some probability will be assigned to the un-annotated transcripts.

Detecting DP genes between two RNA-Seq samples

The psg_infer_diff_processing.py script detects DP genes between two samples, given RNA-Seq data for each. The method it implements detects differences in estimated processing frequencies that are more extreme than one would expect given RNA-Seq technical variation. Note that it does not determine whether such differences are biologically significant, i.e., that they are more extreme than one would expect given biological and technical variation between samples from the same conditions. Handling of multiple samples per condition and biological variation will be supported in the near future.

Before running this script, psg_infer_frequencies.py must be run on each RNA-Seq sample separately. The script is run from the command line as

psg_infer_diff_splicing.py [options] ref_name comp_name sample1_name sample2_name

where comp_name is the name of the output directory to create with the results, sample1_name is the name of the first sample (as provided to psg_infer_frequencies.py), sample2_name is the name of the second sample (as provided to psg_infer_frequencies.py), and ref_name is the name of the PSG reference (as prepared by psg_prepare_reference.py).

The script has a number of options, which can be listed by running

psg_infer_diff_splicing.py --help

In general, the options provided to psg_infer_diff_splicing.py should be the same as those provided to psg_infer_frequencies.py for each sample.

Outputs

The primary output of psg_infer_diff_splicing.py is the file diff.txt in the comp_name output directory. This file contains a tab-delimited table containing the results of DP test for each gene. In this table, the columns of main interest are:

The psg_infer_diff_splicing.py script additionally outputs gene.results.txt and isoform.results.txt files that result from estimating a single set of PSG parameters for each gene with both samples combined. These files have the same format as those output by psg_infer_frequncies.py

Example usage

Suppose that we are interested in detecting DP genes between the K562 and HUVEC human cell lines. We have a paired-end 75bp RNA-Seq data set for each cell line, given in the files k562_1.fastq, k562_2.fastq, huvec_1.fastq, and huvec_2.fastq. We wish to use “line” PSG models based off of the GENCODE v16 gene annotation, which we have saved in the file gencode_16.gtf. We have FASTA files for each (hg19) human chromosome stored in the directory hg19_chroms. We will use MAP estimation (pseudocount = 1) for the DP test. The example command line workflow for this scenario is:

psg_prepare_reference.py -k 0 -g hg19_chroms gencode_16_line < gencode_16.gtf
psg_infer_frequencies.py gencode_16_line k562 k562_1.fastq k562_2.fastq
psg_infer_frequencies.py gencode_16_line huvec huvec_1.fastq huvec_2.fastq
psg_infer_diff_processing.py gencode_16_line k562-huvec k562 huvec

File Formats

PSG files

A PSG file provides the structure, sequences, and parameters for the PSG model of a gene. It has two sections, [segments] and [edges], which describe the segments (vertices) and edges of the PSG, respectively.

The [segments] section

Each line in the [segments] section is tab-delimited and has the form

segment_num sequence start_coordinate end_coordinate

where segment_num is the index of the segment, sequence is the transcript segment sequence corresponding to that segment, and start_coordinate and end_coordinate are the start end end coordinates of the segment relative to the start coordinate of the most upstream (5’) segment. In this section, segments are ordered according to segment_num. Segment 0 is the start vertex and corresponds to the empty string. For a PSG with M segments, segment M-1 is the end vertex and also corresponds to the empty string.

The [edges] section

Each line in the [edges] section is tab-delimited and has the form

edge_num source_segment target_segment probability

where edge_num is the index of the edge, source_segment is the index of the source segment, target_segment is the index of the the target segment, and probability is the conditional probability weight for the edge. To have a valid PSG model, for each vertex, the sum of the probabilities on its outgoing edges must sum to one.

Authors

The PSGInfer software package was written by Laura LeGault (Java core PSG inference algorithms) and Colin Dewey (Python front-end scripts for genome-scale analyses).

Credits

Citing PSGInfer

If you use PSGInfer as part of published work, please cite:

Laura H. LeGault and Colin N. Dewey. (2013) Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs. Bioinformatics. 29(18):2300-2310.

Copyright and License

Copyright (C) 2013 Colin Dewey and Laura LeGault

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.