Is there an agreed definition as to how many nucleic acid bases constitute a gene?
If not, why not? I'm not sure I understand how the exact sizes of genes are defined.
Answer
Is there an agreed-upon definition as to how many nucleobases constitute a gene?
If not, why not?
There is no such definition. A gene is a region of the DNA that is transcribed. Typically a gene should have a transcription start site dictated by a promoter and a transcription stop site marked by termination signals (like terminators and poly-A signal etc.)
There are some little RNAs (~18nt) that are produced from TSS of usual genes but are probably products of failed elongation. These are not really considered genes as they are heterogeneous in size and are not marked by any boundary.
There may technically be a minimum cutoff on gene length which could be the length of DNA necessary for the RNA-polymerase to sit and also include the termination signals. As indicated in the comments, the smallest gene may be the tRNA. However, the smallest annotated gene from the GENCODE annotations is TRDD1 (just 7nt long!!!). This is not based on gene prediction; it is manually annotated by the HAVANA team.
What is the average length of a gene?
I just did a rough calculation from the GENCODE human genome annotation file (version 23).
The average transcript length seems to be around: 1.5kb
The average gene length seems to be around: 29kbp
The genes would be longer than (or equal to) their corresponding transcripts because the latter gets shortened due to splicing.
I made a histogram plot of these lengths for convenience:
Transcript length distribution
Gene length distribution
Note the sharp peaks at 100bp. Quite interesting!
Remi has user19099 have mentioned that the longest gene in humans is titin. It seems that it is the longest gene in many other diverse animals. See What's the longest transcript known? for more details.
Methodology (so that limitations can be identified)
To calculate gene length distribution: I parsed the GTF file for "genes" (third field i.e. feature) and subtracted the fifth field (stop) from fourth (start).
To calculate transcript length distribution: Got the transcript fasta file from the annotated locations. Calculated their lengths. Plotted the distribution.
No comments:
Post a Comment