dna sequencing - Finding proteins in DNA sequence

Thursday, 2 February 2017

dna sequencing - Finding proteins in DNA sequence

I have to do a task for a university task and I need to understand some things before figuring out how to do it.

The task is the following:

Find matches of known proteins (DNA-PolyI,II,III) to the specific E.Coli DNA, sequence.

I downloaded in FASTA format the protein sequence of ~~DNA-Poly3~~ DNA-Poly1 of E.coli (strain K-12) and the entire DNA sequence of the E.Coli.

I've studied a bit on-line and using the BioRuby gem and the Ruby programming language I wrote a program that translates DNA to protein sequence. Then I tried to match the known DNA-Poly3 sequence but it did not match. After searching a bit on-line again, I learned about ORF and and the 6 possible reading ways of each frame. The longer, in terms of codons, ORF conformation is chosen but there's no way of telling for sure that the protein was made using this frame.

Then I've read about TATA boxes, but I can't use those since they can be found only in Eukaryotic and Archaea.

So how should I proceed in order to solve this problem: How can I prove that the DNA-Poly3 gets produced by a specific area (gene) in the DNA sequence?

Thanks for your time,

ps. Insights and hints are very much welcomed as this is just the tip of the iceberg for me and I'm very willing to study bioinformatics :-)

EDIT: This is an update for info requested in relevant answer

The files I have used are the following:

➜  Bioinfo  ruby dogma.rb 
----------------
DNA Length: 4639675

gi|48994873|gb|U00096.2| Escherichia coli str. K-12 substr. MG1655, complete genome
----------------
DNA Poly-1 sample: 928
gi|16131704|ref|NP_418300.1| fused DNA polymerase I 5'->3' polymerase/3'->5' exonuclease/5'->3' exonuclease [Escherichia coli str. K-12 substr. MG1655]

You can download them here: E.Coli DNA and E.Coli DNA-Poly1.

NOTE: My sample protein is DNA Polymerase I (and not 3).

Answer

IMPORTANT EDIT : In your particular case, if you are working with bacterial genes, splicing is not an issue since bacteria do not have introns. I am leaving the information here since it may be useful to someone else. However, I recommend you focus on the UTRs since they are probably what is causing you problems.

There are three things that could be causing you problems. I will briefly touch on each one. I will talk about all genes, bear in mind that bacteria have no introns so any discussion of splicing and/or introns and exons is not directly relevant to your problem.

1. UTRs

Untranslated Regions (UTRs) are sequences at the beginning and end of a gene that are not translated into protein. UTRs are regions that are part of the original genomic sequence, they are also part of the mature mRNA (indeed, UTRs are sometimes modified by splicing events, they are exons not introns) but they do not get translated into protein. To illustrate, have a look at this simplified representation of an mRNA molecule:

enter image description here

Only the green exons will make it into the final protein. Introns are spliced out and UTRs are not translated.

Therefore, if you translate the entire gene, you will not get the correct protein.

2. Reading frames

Genes are read in words of three letters (the codons). The sequence ATGTGTACCTGA has six possible reading frames (three on each strand) which can be read and translated as follows:

5'3' Frame 1
```
    ATG TGT ACC TGA
     M   C   T  Stop
```

5'3' Frame 2

    a TGT GTA CCT ga
       C   V   P

5'3' Frame 3

    at GTG TAC CTG a
        V   Y   L

3'5' Frame 1
```
TCA GGT ACA CAT
 S   G   T   H  
```

3'5' Frame 2
```
t CAG GTA CAC at
   Q   V   H    
```

3'5' Frame 3
```
tc AGG TAC ACA t
    R   Y   T    
```

DNA is double stranded. The sequence of one strand is complementary to that of the other, therefore if you have one strand you can infer the sequence of its complementary one. Genes can be found on either strand, the two are equivalent biologically. However, sequencing projects choose one of the two strands (randomly) and call it the plus (+) strand and then save all sequences with respect to that strand. This means that sometimes the genomic sequence that you download from a database might be the complement of the actual sequence you are looking for.

3. Names

I once heard someone say in a conference that

Biologists would rather share a toothbrush than a gene name.

While that might be a little exaggerated, naming conventions vary between research communities and species and databases. So, are you sure that you have downloaded the correct gene? Where did you get it from? How did you identify it? Does the sequence also contain up/downstream regulatory regions, promoters, enhancers and the like? If you post the exact sequence you are attempting to use I can give you more specific help.

For example, the first 20 hits when searching for the E. coli DNA Polymerase 3 in ncbi's nucleotide database, are whole genome shotgun sequences. These do not correspond to the gene sequence you are looking for. They are huge pieces of the genome (or even the entire genome) that will contain your gene and many others. Look at the Tools section below for suggestions on extracting your gene from the whole genome.

4. Splicing (irrelevant to bacteria)

Another possible problem is splicing. Lets start with the basics, the process of producing a eukaryotic (bacteria have no introns) protein from a genomic sequence is summarized in the image below (modified slightly from here):

enter image description here

Transcription begins at the transcription start site (TSS) but not all the transcribed sequence is translated into protein. First, the introns are spliced out of the mRNA to produce the mature mRNA (other things like capping and poly-A addition also occur but are not relevant here). So,the mature mRNA contains the exons of the coding gene. This means that a linear translation of the gene's sequence will not correspond to the protein produced. You will need to take splicing into account.

Also, bear in mind that splicing will change the reading frame.

Now, if the sequence ATGT were spliced at, for example, AT/gt (most splice events cut/join at GT/AG sites) and joined with the sequence agATTATT, the resulting (spliced) sequence would be (the splicing process will remove the gt from the first sequence and the ag from the second):

ATATTATT

As you can see, the reading frame has now changed. Where before, in the first reading frame, we had the codon ATG, the canonical translation initiation codon, we now have ATA which codes for isoleucine (I). I hope that is clear, the main point is that splicing can change the reading frame.

5. Tools

OK, that was the background. Now, what you will need to do is use existing programs that model splice sites and can correctly align a protein sequence to genomic DNA. My personal favorites are exonerate and genewise. On a Debian-based Linux distribution, you can install them with this command:

sudo apt-get install exonerate wise

Then, to align the protein to its gene do:

exonerate -m protein2genome -n 1 prot.fa dna.fa > out.txt

genewise -pep -pretty -gff -cdna prot.fa dna.fa > out.txt

In my experience exonerate is (much) faster but genewise is a little more accurate. I usually use exonerate if I am dealing with a whole genome and genewise if I only have a few kilobases of sequence. Both are very good and both will be able to align a protein to its genome of origin.

I will not explain all these options because that is beyond the scope of this site. Have a look at their documentation (which is quite good and clear) and if you still have problems, you could ask a question on our sister site, Bioinformatics Stackexchange

Alternatively, you could link your web application to the ucsc genome browser BLAT service. Click here to see the results when aligning the human DNA-directed RNA polymerase II subunit RPB1 protein.

Blog

Thursday, 2 February 2017