Saturday 10 March 2018

bioinformatics - How to selectively download 3 fields from every record in UniProtKB?


I want to download a comprehensive table of protein names from uniprot.org.


More precisely, I want to generate a tab-delimited table consisting of the id (accession), entry name, and protein names "columns" from the UniProtKB database.


I want to get these three columns/fields for all 80M records in UniProtKB, and I don't want to specify all those Uniprot IDs through, e.g., a bazillion URL-encoded queries. Also, I need to do this from a host that I can access only through a text interface, which basically rules out browser-based solutions.



I just spent a couple of hours going back-and-forth over the Uniprot site's docs, and I can find nothing useful. The Perl example given there1 shows how to download full records, but downloading every full record from UniProtKB would be too slow and onerous to be considered2.


Does anyone know how to modify the Perl example (or any other way) to download only the three desired columns from UniProtKB?




1 You need to click on the phrase "Perl example" to see the code.
2 I downloaded a small test sample of 1000 full records, I found that the size of the information I actually want from those records is only 0.2% of the total. IOW, downloading the full records would take up ~500x as long as downloading only the desired information.




No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...