Thursday 25 April 2019

bioinformatics - Why and how does uniprot list around 150,000 proteins in the human genome?


Using organism:"Homo sapiens (Human) [9606]" as a query in uniprot returns about 146,000 proteins. I was under the impression that there were only 20-25,000 protein coding genes in the human genome. Is this something to do with splice isoforms, like those from SpliceProt, or another splice isoform database or tool?



Answer




Well you are assuming one sequenced genome/proteome per NCBI tax id. That is no longer true. So if you click on the proteome filter it decreases by half. Which gets you into the 60,000 range. Now not all of these are "different" conceptual proteins, many are artifacts from the way GenBank/EMBL/DDBJ interact with the TrEMBL section of UniProtKB i.e. they are not normalized in db speak.


So for the human case you also want to add the Swiss-Prot filter to get a decent proteome that gives you about 20,000 proteins. Corresponding to the predicted/confirmed human gene count.


In all bioinformatics databases you need to pay attention to difference between database records and biological concepts. They rarely map cleanly one to one. In this case an UniProt record is not a protein, but information about a protein and a different record could have information about the same "protein". Or at least the same under some definitions of "same".


See the announcement of the draft human proteome in UniProtKB


The known isoforms are most often stored in the alternative products section of an UniProt entry. In some rare cases when the splice variant has a completly different biological function they are described in separate UniProt entries. For UniProtKB/Swiss-Prot Human there is a near 1 to 1 relation between genes and proteins. Cases as above and fusion proteins are the exceptions to this rule.


TrEMBL tries to automatically reduce redundancy in INSDC by auto merging entries in the same taxid and now proteome that have an identical reported sequence. However, variations of single gene products due to mutations and or limit what auto merging can do. For example there are 8 records for the P53 gene in TrEMBL today. Many of them from mutants, i.e. cancer genomes etc...


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...