README for CSEM

Bo Li (bli at cs dot wisc dot edu)


Table of Contents


Introduction

CSEM is a ChIP-Seq multi-read allocator. CSEM stands for ChIP-Seq multi-read allocation using Expectation-Maximization. It is part of the following paper:

Chung, D., Kuan, P. F., Li, B., Sanalkumar, R., Liang, K., Bresnick, E., Dewey, C. N., Keles, S. (2011). Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-seq data. PLoS Computational Biology, 7(7):e1002111.

Prerequisites

C++ and Perl are required to be installed.

In addition, you need to have an aligner to align your reads. For example, you can install Bowtie.

Compilation & Installation

To compile CSEM, simply run

make

To install, simply put the csem directory in your environment’s PATH variable.

Usage

This program handles both single-end and paired-end data in either SAM or BAM format.

I. Aligning Reads

You can choose whatever aligner you want if the aligner meets the following criteria:

  1. The output should be in either SAM or BAM format;
  2. The aligner should be able to output multiple alignments for a same read;
  3. All alignments of a same read should be grouped together;
  4. The two mates of a paired-end alignment should be adjacent to each other.

In addition, you should make sure your aligner output only high quality alignments. CSEM treats all alignments by a same read equally.

II. Allocating Multi-Reads

To allocate multi-reads, you should run the ‘run-csem’ program. Type

run-csem --help

to get usage information.

III. Generating Input Files for ChIP-Seq Peak Callers

To generate input files for ChIP-Seq peak callers, you should run the ‘csem-generate-input’ program. Type

csem-generate-input --help

to get usage information.

IV. Generating Wiggle Files

A wiggle plot representing the expected number of reads overlapping each position in the genome set can be generated from the sorted genome BAM file output. To generate the wiggle plot, run the ‘csem-bam2wig’ program on the ‘*.sorted.bam’ files you have.

Usage:

csem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight] [--extend-reads fragment_length] [--only-midpoint] [--help]

sorted_bam_input : Input BAM format file, must be sorted
wig_output : Output wiggle file’s name, e.g. output.wig
wiggle_name : The name of this wiggle plot
–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by CSEM
–extend-reads fragment_length : Extend reads to their full fragment length. fragment_length is the average fragment length for the data set and should be positive
–only-midpoint : represent each fragment by its midpoint. This can be set only if –extend-reads is set
–help : Show help information

V. extractFromEland

CSEM provides a small gadget to extract reads from an eland alignment file and write them into a FASTA file. All reads marked as “QC” or “RM” are discarded.

Usage:

extractFromEland inpF outF

inpF : an eland alignment file outF : the name of output FASTA file

Example

We assume the reads are single-end. The reads file is in FASTQ format and named as ‘GATA1.fq’ under current directory.

Aligning Reads

We take bowtie aligner as our chosen aligner as an example.

First, you will need to download or create a Bowtie index for your genome. The Bowtie website provides pre-built indices for a number of commonly-used genomes. For example, you can download the index for M. musculus (UCSC mm9) from the Bowtie website. To create an index for some other genome, refer to the Bowtie documentation.

Here we just assume that the bowtie is appropriately installed. The bowtie indices for mm9 are under current directory and prefixed by mm9. We allow at most 2 mismatches over a whole read and discard any read with no less than 100 alignments. We also use 8 threads for bowtie and output results in SAM format as ‘GATA1.sam’:

bowtie -q -v 2 -a -m 99 -p 8 -S mm9 GATA1.fq GATA1.sam

Allocating Multi-Reads

We want to use 8 threads. The average fragment length is 200 bp and output name is set as ‘GATA1_csem’:

run-csem --sam -p 8 GATA1.sam 200 GATA1_csem

The outputs are ‘GATA1_csem.bam’, ‘GATA1_csem.sorted.bam’ and ‘GATA1_csem.sorted.bam.bai’.

Generating Wiggle Files

We want to look at read coverages after allocating multi-reads. We also want to extend each read to 200bp:

csem-bam2wig GATA1_csem.sorted.bam GATA1_csem.wig GATA1_csem --extend-reads 200

Generating Input Files for ChIP-Seq Peak Callers

For example, we want to generate input file for MOSAiCS:

csem-generate-input --bed GATA1_csem.bam GATA1_csem

‘GATA1_csem.bed’ will be generated and can be used as MOSAiCS’ input.

Authors

CSEM is developed by Bo Li, with substaintial technical input from Colin Dewey.

Acknowledgements

CSEM uses the Boost C++ and samtools libraries.

We thank Dongjun Chung for his help on testing the midpoint approach for CSEM.

License

CSEM is licensed under the GNU General Public License v3.