Wednesday, 25 October 2017

molecular biology - basic programming and bioinformatics



As a molecular biology graduate student I have decided to learn some basic programming and bioinformatics since everybody says that it is crucial. For example, what would you learn if you need to work with RNA-Seq data, compare and interpret them?


Thanks!



Answer



Indeed the question is broad and quite hard to answer I think. I'll give a try. I very welcome editing to improve this answer.


The field of bioinformatics is a big field. Bioinformaticians need basic knowledge in



  • biology


  • molecular genetics

  • Population genetics

  • programming

  • statistics


You may find courses on statistics applied to bioinformatics here (R-language) and here (I haven't watched these sources).


How to start programming? - Python


You seem to be mostly interested in is programming. I think that Python is a very good start to get in touch with programming. Programming might look a bit scary when you don't really know what it is about but you can easily, in a few days, acquire basic knowledge in this field and already solve some pretty neat problems. Many people actually have lots of fun learning how to program. And you'll probably be amazed by all the power this tool will offer to you. I personally really enjoyed learning to programming in Python. I did it (I was mostly interested in object-oriented programming, you'll learn what it means) in a day or two with a very good source but unfortunately, this source is not available in English. But there are tons of introductory documents, you'll have no difficulty to find a good one. I'd counsel you to directly download Python and to look at online courses on khan academy or EdX (I haven't watched them).


Data analysis - R


While Python is very popular, I think that, as a biologist, it is very important that you know about R. R is a programming language which is slow (compare to Python, C, Java, …) but it is very useful for statistical analysis and visual display of data. Also, many people use R in bioinformatics (for phylogenetic analysis typically). I think that acquiring basic knowledge in R takes more time than in Python because we tend to use R because of its huge amount of already existing functions and therefore, we have to learn many of these functions before understanding that R can indeed be much more useful than Python for some tasks.



Command line - Shell script


Shell script (BASH for example) is a very specific and very important language too. Very useful for manipulating, transferring files, managing processes or pretty much anything that is happening on your computer.


Other


C and C++ are very fast and very much used as well. Perl is commonly used for genomic sequence analysis (although Perl is slowly losing users to the advantage of Python).


Usefulness of programming


You also ask about the usefulness of programming. Well, it is used in pretty much all areas of biology. It is used for analyzing empirical data, computer simulations in population genetics, graph theory, annotating DNA sequences, … I guess that 98% of biologists have at least some basic knowledge in programming. The main point about programming is that it performs calculation much faster than anything you could ever realize with your calculator. Typically, in bioinformatics, analysis of DNA sequences often asks for very intense calculation and asks for big computation power. Processes such as constructing phylogenetic trees, determining a goodness of fit of evolutionary models, annotating DNA, aligning DNA sequences, analyzing microarray and many other things are all sorts of tasks that require programming.


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...