bioinformatics - introduction to Chip Seq

Monday, 12 September 2016

bioinformatics - introduction to Chip Seq

I hope this question is suitable for this site. I am concerned about the Chip experiment part so I think it should be okay. I am a Applied Math student starting to get into bioinformatics and so I've been looking at Chip Seq data. But to make sense of the Chip Seq data I wish to understand how the experiment is first performed.

Basically my question relates to step 1 of the following diagram: enter image description here

Now my main question is how come the reads are overlapped? Where did we get so many of the (almost the) same reads?

My "understanding" is that:

1) They take a DNA sequence and crosslink the protien of interest. Then they get rid of the DNA sequence surrounding this area of interest so now we have a "small" sequence of DNA and in this small DNA is somewhere where our TF binds. Now we make copies of this small DNA seq and run it through a sequencing machine. Is this correct?

2) They take a bunch of cells = bunch of DNA sequences. Then they do the same procedure above (by crosslinking and getting a "smaller" DNA sequence of interest). Since they had many cells to begin with, this means they had many DNA seq to begin with. Now they shear the DNA seq and we have fragments. Then we align these up with the reference genome.

Answer

The basic steps of ChIP-Seq are:

Crosslinking proteins to DNA - this fixes the proteins in their natural positions

Nuclease digestion - this removes regions that are unbound to protein; nucleases are sterically hindered from digesting protein-bound DNA

Immunoprecipitation - this allows isolation of the target protein by binding it to a selective antibody

Sequencing to determine the bound DNA sequence

Mapping the reads to the reference genome

Why so many reads? They are taking many reads from different cells. Why are the reads so short? Many modern next-gen sequencing methods are limited to ~30 bp reads. Thus, the reads you see are not actually the entire protected sequence but rather just a shorter fragment from the 5' end (where sequencing starts from). The why behind this is explained by Jiang and Pugh, 2009 (Nucleosome positioning and gene regulation: advances through genomics):

Three DNA sequencing technologies have been used to map nucleosomes. Pyrosequencing using the Roche 454 GS20/FLX sequences nucleosomel DNA end-to-end, allowing both nucleosome borders to be linked in a single sequence. This method provides the greatest mapping accuracy, particularly in genomic regions of low complexity. By contrast, other platforms, such as those provided by the Illumina-Solexa Genome Analyzer and Applied Biosystems SOLiD, generate only 25-35 by sequence tags, requiring both nucleosome borders to be inferred. Nonetheless, these short-read technologies produce >100 times the number of sequence tags at a similar cost as the long-read technologies, and so the short-read technology is currently the only practical technology for mapping nucleosomes in large genomes. The higher tag count of the short-read technology enhances mapping accuracy and thus provides a practical way of mapping nucleosomes.

Why are the reads overlapped? I can think of a few explanation for that. For starters, Protein-DNA binding is variable in its specificity. In other words, a given protein might bind at the exact same site every time or it may bind 'around' a site or it may bind randomly. A protein that binds an exact sequence would be expected to produce reads that are more in phase that one that binds more variably. Next, the nuclease digestion process is random (or at least expected to be). Thus the DNA cleavage point relative to the protein is variable and your reads start from different points. Again, from Jiang and Pugh:

At any given genomic locus, the preferential positioning of nucleosomes--called phasing--can be described. At most loci, there is an approximately Gaussian (normal) distribution of nucleosome positions around particular genomic coordinates, ranging from ~30 by for highly phased nucleosomes to a random continuous distribution throughout an array. How much of this variation is due to genuine positional heterogeneity and how much is an artefact that is caused by overtrimming or undertrimming of the DNA at nucleosome borders by micrococcal nuclease during sample preparation remains to be determined.

Does that answer your question? Basically the reads are overlapped because there are many of them and they may differ either because of actual variance in the DNA binding site (which is biologically significant) or because of variance in the DNA digestion site (which is a methodological flaw). As with any biological experiment, you must perform many replicates to statistically determine the source of the variance and this is done by taking many reads. I recommend you read the article I quoted from, it is a good overview of histone mapping with ChIP and is likely also applicable to what you're studying.

Also, do you know why there are two sets of reads (red and blue)? I would venture a guess that they are taken from opposite strands. In other words, they are reads of the same DNA region taken from different directions. That might help clarify what's going on.

To conclude:

DNA is not extracted from a single cell. In usual procedures it is always from a collection of cells or from tissues.

The overlap in TF associated reads arises because of the TF bound sequence coming from different cells in this collection.

Each DNA piece is amplified and clusters are formed in the NGS sequencers (for the ease of detection);

In a homogeneous cell collection i.e. all cells are of same type, the variability in the read position usually arises because of flexibility of binding sites (i.e TF wont exactly bind to at a certain locus). Also, as previously mentioned the variability can arise during the technical procedures.

In case of heterogeneous tissues the variability arises because of different biochemical status of different cells. Strictly speaking there is always some heterogeneity even in an apparently homogeneous tissue but tissues such as brain are highly heterogeneous

In the case of nucleosomes the positioning is not exact because the sequence contribution to the position is very minimal. Certain nucleosome positions are tightly maintained for e.g. around the TSS (transcription start sites) whereas in other regions the positioning is fuzzy because of the dynamic nature of the chromatin.

Blog

Monday, 12 September 2016

bioinformatics - introduction to Chip Seq

No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?