PSGInfer is a software package for the analysis of RNA-Seq data with probabilistic splice graph (PSG) models of gene alternative processing (splicing, transcription initiation, and polyadenylation). The software currently performs three tasks:
Construction of PSG models from known transcript annotations.
Maximum likelihood (ML) or maximum a posteriori (MAP) estimation of PSG parameters given an RNA-Seq data set.
Prediction of genes that are differentially processed (DP) between two samples, given RNA-Seq data for each sample.
PSGInfer is currently composed of a number of Python and Java programs and therefore does not require any compilation. After unpacking the PSGInfer tarball (psginfer-X.X.X.tar.gz), simply add the unpacked directory to your PATH environment variable.
PSGInfer requires that the following software packages be installed and available via your PATH environment variable.
The tasks performed by PSGInfer are run via the following front-end scripts:
psg_prepare_reference.py
: Constructs PSG models from known transcript
annotations and produces additional internal files (e.g., a Bowtie index for
alignment) required by the other scripts.
psg_infer_frequencies.py
: Estimates ML or MAP parameters of PSG models
given RNA-Seq data from a single sample.
psg_infer_diff_processing.py
: Predicts the genes that are DP between two
samples, given RNA-Seq data for each sample.
All analyses with the PSGInfer package require that you first run the
psg_prepare_reference.py
script. This script constructs PSG models from
known transcript annotations and prepares internal index files required by the
other scripts. The psg_prepare_reference.py
script requires that you provide
the following:
Known transcript annotations in GTF format
A directory containing FASTA-formatted chromosome sequence files, with one file per chromosome. The file for chromosome CHROM should be named CHROM.fa or CHROM.fasta.
The script is run from the command line as
psg_prepare_reference.py [options] reference_name
where reference_name
is the name of the output directory that will be
created and the name that you will use to refer to the produced set of PSG
models. The script has a number of options, which can be listed by running
psg_prepare_reference.py --help
The following command line exemplifies common usage of this script
psg_prepare_reference.py -g genome_directory myreference < annotations.gtf
The psg_prepare_reference.py
script creates a number of files in the output
directory reference_name
. Of these, the only file of major interest to users
is the index.txt
file, which lists the the genes in the reference and
statistics about their corresponding PSG models. In addition, the constructed
PSG model for gene GENE on chromosome CHROM is provided in the file
CHROM/GENE.psg
. The format of the PSG file is described later in this
document. Other files output by the psg_prepare_reference.py
script are
primarily for internal use, and include Bowtie indices for performing
alignments against the PSG model sequences.
The estimation of the parameters of the PSG models from an RNA-Seq sample is
performed by the psg_infer_frequencies.py
script. This script requires as
input a set of PSG models (as prepared by psg_prepare_reference.py
)
and FASTQ-formatted read files. The script is run from the command line as
psg_infer_frequencies.py [options] reference_name sample_name mate1_read_file [mate2_read_file]
where reference_name
is the name of the PSG reference (as prepared by
psg_prepare_reference.py
), sample_name
is the name of the output
directory that will be created, and mate1_read_file
and
mate2_read_file
are the first and (optionally) second mate
FASTQ-formatted read files, respectively. The script has a number of
options, which can be listed by running
psg_prepare_reference.py --help
The following command line exemplifies common usage of this script
psg_infer_frequencies.py myref mysample mysample_1.fastq mysample_2.fastq
This script first runs Bowtie to align the RNA-Seq reads against the PSG segment sequences (and junctions) and then uses the resulting alignments for PSG parameter estimation.
The primary output of psg_infer_frequencies.py
is the file
gene.results.txt
in the sample_name
output directory. This file contains a
tab-delimited table containing the results of parameter estimation for each
gene’s PSG model. In this table, the columns of main interest are:
num_reads
: the number of fragments (as represented by one read in
single-end data and a pair of reads in paired-end data) that
mapped to this gene.
log_likelihood
: the log probability of the reads mapped to this gene given
its PSG model and (ML or MAP) estimated parameters.
parameters
: the (ML or MAP) estimated parameters for the PSG, given as a
comma-delimited list of probabilities, in the same order as the edges to
which they correspond.
A secondary output of psg_infer_frequencies.py
is the file
isoform.results.txt
, which provides within-gene relative frequencies of the
known annotated transcripts. These frequencies are computed according to the
estimated probabilities of the processing events implied by each annotated
transcript. If the PSG for a gene implies additional un-annotated transcripts,
then, in general, the frequencies for the known annotated transcripts will not
sum to one because some probability will be assigned to the un-annotated
transcripts.
The psg_infer_diff_processing.py
script detects DP genes between two
samples, given RNA-Seq data for each. The method it implements detects
differences in estimated processing frequencies that are more extreme than one
would expect given RNA-Seq technical variation. Note that it does not
determine whether such differences are biologically significant, i.e., that
they are more extreme than one would expect given biological and technical
variation between samples from the same conditions. Handling of multiple
samples per condition and biological variation will be supported in the near
future.
Before running this script, psg_infer_frequencies.py
must be run on each
RNA-Seq sample separately. The script is run from the command line as
psg_infer_diff_splicing.py [options] ref_name comp_name sample1_name sample2_name
where comp_name
is the name of the output directory to create with the
results, sample1_name
is the name of the first sample (as provided to
psg_infer_frequencies.py
), sample2_name
is the name of the second sample
(as provided to psg_infer_frequencies.py
), and ref_name
is the name
of the PSG reference (as prepared by psg_prepare_reference.py
).
The script has a number of options, which can be listed by running
psg_infer_diff_splicing.py --help
In general, the options provided to psg_infer_diff_splicing.py
should be the
same as those provided to psg_infer_frequencies.py
for each sample.
The primary output of psg_infer_diff_splicing.py
is the file diff.txt
in
the comp_name
output directory. This file contains a tab-delimited table
containing the results of DP test for each gene. In this table, the columns of
main interest are:
sample1_num_reads
: the number of fragments (as represented by one
read in single-end data and a pair of reads in paired-end data) that
mapped to this gene in sample 1.
sample2_num_reads
: the number of fragments (as represented by one
read in single-end data and a pair of reads in paired-end data) that
mapped to this gene in sample 2.
joint_num_reads
: the total number of fragments (as represented by
one read in single-end data and a pair of reads in paired-end data)
that mapped to this gene. This is the sum of sample1_num_reads
and
sample2_num_reads
.
max_delta_psi
: the maximum difference of the estimated edge weight
parameters between the two samples.
sample1_ll
: the log likelihood for sample 1 given (ML or MAP) estimated
parameters for that sample.
sample2_ll
: the log likelihood for sample 2 given (ML or MAP) estimated
parameters for that sample.
diff_ll
: the total log likelihood for both samples, with separately
estimated parameters for each sample (the alternative hypothesis, which
corresponds to DP). This is the sum of sample1_ll
and sample2_ll
.
null_ll
: the log likelihood of both samples combined, with a single set of
estimated parameters for the combined samples (the null hypothesis, which
corresponds to no DP).
lrt_d
: the likelihood ratio test statistic for DP.
Equal to 2 * (diff_ll
- null_ll
)
df
: the number of additional degrees of freedom for the alternative
hypothesis, which, in this case, is the number of free parameters in the
PSG.
p_value
: the p-value for the likelihood ratio test that the gene is DP
between the two samples.
fdr
: the estimated false discovery rate (FDR) if this gene and all others
with p-value less than or equal to this gene’s p-value are predicted as DP.
The psg_infer_diff_splicing.py
script additionally outputs
gene.results.txt
and isoform.results.txt
files that result from
estimating a single set of PSG parameters for each gene with both samples
combined. These files have the same format as those output by
psg_infer_frequncies.py
Suppose that we are interested in detecting DP genes between the K562 and
HUVEC human cell lines. We have a paired-end 75bp RNA-Seq data set for each
cell line, given in the files k562_1.fastq
, k562_2.fastq
, huvec_1.fastq
,
and huvec_2.fastq
. We wish to use “line” PSG models based off of the GENCODE
v16 gene annotation, which we have saved in the file gencode_16.gtf
. We have
FASTA files for each (hg19) human chromosome stored in the directory
hg19_chroms
. We will use MAP estimation (pseudocount = 1) for the DP test.
The example command line workflow for this scenario is:
psg_prepare_reference.py -k 0 -g hg19_chroms gencode_16_line < gencode_16.gtf
psg_infer_frequencies.py gencode_16_line k562 k562_1.fastq k562_2.fastq
psg_infer_frequencies.py gencode_16_line huvec huvec_1.fastq huvec_2.fastq
psg_infer_diff_processing.py gencode_16_line k562-huvec k562 huvec
A PSG file provides the structure, sequences, and parameters for the PSG model
of a gene. It has two sections, [segments]
and [edges]
, which describe the
segments (vertices) and edges of the PSG, respectively.
[segments]
sectionEach line in the [segments]
section is tab-delimited and has the form
segment_num sequence start_coordinate end_coordinate
where segment_num
is the index of the segment, sequence
is the transcript
segment sequence corresponding to that segment, and start_coordinate
and
end_coordinate
are the start end end coordinates of the segment relative to
the start coordinate of the most upstream (5’) segment. In this section,
segments are ordered according to segment_num
. Segment 0 is the start vertex
and corresponds to the empty string. For a PSG with M segments, segment M-1 is
the end vertex and also corresponds to the empty string.
[edges]
sectionEach line in the [edges]
section is tab-delimited and has the form
edge_num source_segment target_segment probability
where edge_num
is the index of the edge, source_segment
is the index of
the source segment, target_segment
is the index of the the target segment,
and probability
is the conditional probability weight for the edge. To have
a valid PSG model, for each vertex, the sum of the probabilities on its
outgoing edges must sum to one.
The PSGInfer software package was written by Laura LeGault (Java core PSG inference algorithms) and Colin Dewey (Python front-end scripts for genome-scale analyses).
If you use PSGInfer as part of published work, please cite:
Laura H. LeGault and Colin N. Dewey. (2013) Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs. Bioinformatics. 29(18):2300-2310.
Copyright (C) 2013 Colin Dewey and Laura LeGault
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.