Sunday, 18 December 2016

theoretical biology - Do biological phenomena follow Gaussian statistics?


I have recently entered the life sciences (from physics). I am concerned about the use of p values in the life sciences literature. For example, in this article, they test 9 - 12 rats in a control group and compare it to an experimental group. They use p values to claim that their results are statistically significant. This type of use of p values seems to be very common in the literature.


So here are my concerns :





  1. Why is it so often assumed that biological measurements follow a normal distribution? To my knowledge, this isn't known a priori.




  2. From my physical intuition, it seems quite challenging to claim "statistically significance" when using such low sample sizes.





Answer



kmm's answer is correct; I just want to add some of my points on what kind of data should follow Gaussian distribution.






Unless you know from observation that a process doesn't follow a Gaussian distribution (e.g., Poisson, binomial, etc.), then it probably does at least well enough for statistical purposes.



I won't fault kmm for this statement because what they said is what happens prevalently. This is practically what all biologists do, but this is an incorrect approach.


Gaussian should not be considered a default distribution. This may lead to incorrect inferences. Usually the experimenter has an idea of what kind of data they are measuring and what distribution is the data likely to follow. If you are unsure of the underlying distribution then you should go for non-parametric statistical tests.




What kind of data follow Gaussian distribution?


According to the Central Limit Theorem, the distribution of the mean (expected value) or sum of several samples of independent and identically distributed (IID) random variables would follow Gaussian distribution. The random variable itself can follow any distribution but if you measure the mean several times using repeated experimentation, then the distribution of the mean would be Gaussian.


From the Wolfram site:




Let $X_1,X_2,...,X_N$ be a set of N independent random variates and each $X_i$ have an arbitrary probability distribution $P(x_1,...,x_N)$ with mean $\mu_i$ and a finite variance $\sigma_i^2$. Then the normal form variate:


$$X_{norm}=\frac{\displaystyle\sum_{i=1}^N x_i-\sum_{i=1}^N \mu_i}{\sqrt{\displaystyle\sum_{i=1}^N \sigma_i^2}}$$


has a limiting cumulative distribution function which approaches a normal distribution.



The wikipedia page on CLT is also quite good. You can have a look at it too.


Usually in biological experiments we measure some property, lets say expression of some gene. When you do several replicates, and there is no specific underlying mechanism that would generate variation (i.e. the errors are purely random), then you would get normally distributed values. Note that this applies only for the sample means. In certain cases, we assume that the variation in the value of a variable is because of some random fluctuation and therefore consider these variables to be normally distributed (not their means but the values themselves); for e.g. the weights of mice which are fed and raised equally. This is just your assumption which constitutes the null hypothesis.


Another point to note is that the variable that is expected to follow normal distribution should essentially be continuous in nature. Some discrete variables can be approximated as continuous but one should have a good reasoning of doing so. For example population sizes, though discrete, can be assumed continuous if the sizes are large.




Poisson distribution is unique and is a discrete distribution. Certain kinds of phenomena result in Poisson distributed RVs. These phenomena should basically be Poisson processes. See this post for details. Poisson distribution basically models the probability of N events in a given time interval for some given rate of events ($\lambda$). This rate is also called the intensity of the distribution.





Binomial is another unique discrete distribution. Genotypes resulting because of Mendelian segregation of genes, for example, follow this distribution. It basically models the probability of N number of events in some M trials. In binomial distribution there are only two possible outcomes. Multinomial distribution is a generalization of binomial distribution with multiple outcomes.




Since both Poisson and Binomial are discrete distributions they should not be confused with normal distribution. However, under certain conditions especially when the number of trials in Binomial distribution is high and binomial probability = 0.5 then it can be approximated as a gaussian with same value of moments. Similarly, if the intensity (rate) of Poisson distribution is high or the time interval is large, the distribution of the Poisson RV can be approximated to Gaussian (with same value of moments). In these cases, the value of the mean goes up significantly, thereby allowing a continuous approximation.


Many datasets show power-law like/skewed normal distributions and people often make the mistake of assuming them to be normal. An example (from my experience) is the expression of all the genes in a cell. Very few genes have high expression and many genes have low expression. This also applies for degree distribution of nodes in some real networks such as gene regulatory network.




In summary you should assume Gaussian distribution when:



  • Variable is a measurement of a value which is repeated several times from identical samples

  • Variability is expected to be random in the control case (in t-test, when you reject null hypothesis you are actually saying that a certain variable does not follow the normal distribution assumed under the null hypothesis)

  • Variable is continuous, or discrete with large sample size



No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...