Bo Li (bli at cs dot wisc dot edu)
CSEM is a ChIP-Seq multi-read allocator. CSEM stands for ChIP-Seq multi-read allocation using Expectation-Maximization. It is part of the following paper:
C++ and Perl are required to be installed.
In addition, you need to have an aligner to align your reads. For example, you can install Bowtie.
To compile CSEM, simply run
make
To install, simply put the csem directory in your environment’s PATH variable.
This program handles both single-end and paired-end data in either SAM or BAM format.
You can choose whatever aligner you want if the aligner meets the following criteria:
In addition, you should make sure your aligner output only high quality alignments. CSEM treats all alignments by a same read equally.
To allocate multi-reads, you should run the ‘run-csem’ program. Type
run-csem --help
to get usage information.
To generate input files for ChIP-Seq peak callers, you should run the ‘csem-generate-input’ program. Type
csem-generate-input --help
to get usage information.
A wiggle plot representing the expected number of reads overlapping each position in the genome set can be generated from the sorted genome BAM file output. To generate the wiggle plot, run the ‘csem-bam2wig’ program on the ‘*.sorted.bam’ files you have.
Usage:
csem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight] [--extend-reads fragment_length] [--only-midpoint] [--help]
sorted_bam_input : Input BAM format file, must be sorted
wig_output : Output wiggle file’s name, e.g. output.wig
wiggle_name : The name of this wiggle plot
–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by CSEM
–extend-reads fragment_length : Extend reads to their full fragment length. fragment_length is the average fragment length for the data set and should be positive
–only-midpoint : represent each fragment by its midpoint. This can be set only if –extend-reads is set
–help : Show help information
CSEM provides a small gadget to extract reads from an eland alignment file and write them into a FASTA file. All reads marked as “QC” or “RM” are discarded.
Usage:
extractFromEland inpF outF
inpF : an eland alignment file outF : the name of output FASTA file
We assume the reads are single-end. The reads file is in FASTQ format and named as ‘GATA1.fq’ under current directory.
We take bowtie aligner as our chosen aligner as an example.
First, you will need to download or create a Bowtie index for your genome. The Bowtie website provides pre-built indices for a number of commonly-used genomes. For example, you can download the index for M. musculus (UCSC mm9) from the Bowtie website. To create an index for some other genome, refer to the Bowtie documentation.
Here we just assume that the bowtie is appropriately installed. The bowtie indices for mm9 are under current directory and prefixed by mm9. We allow at most 2 mismatches over a whole read and discard any read with no less than 100 alignments. We also use 8 threads for bowtie and output results in SAM format as ‘GATA1.sam’:
bowtie -q -v 2 -a -m 99 -p 8 -S mm9 GATA1.fq GATA1.sam
We want to use 8 threads. The average fragment length is 200 bp and output name is set as ‘GATA1_csem’:
run-csem --sam -p 8 GATA1.sam 200 GATA1_csem
The outputs are ‘GATA1_csem.bam’, ‘GATA1_csem.sorted.bam’ and ‘GATA1_csem.sorted.bam.bai’.
We want to look at read coverages after allocating multi-reads. We also want to extend each read to 200bp:
csem-bam2wig GATA1_csem.sorted.bam GATA1_csem.wig GATA1_csem --extend-reads 200
For example, we want to generate input file for MOSAiCS:
csem-generate-input --bed GATA1_csem.bam GATA1_csem
‘GATA1_csem.bed’ will be generated and can be used as MOSAiCS’ input.
CSEM is developed by Bo Li, with substaintial technical input from Colin Dewey.
CSEM uses the Boost C++ and samtools libraries.
We thank Dongjun Chung for his help on testing the midpoint approach for CSEM.
CSEM is licensed under the GNU General Public License v3.