Monday 18 January 2016

biochemistry - What does it mean to "write an image and GIF into the DNA of bacteria"?


BBC News recently published an article saying that:



An image and short film has been encoded in DNA, using the units of inheritance as a medium for storing information ... The team sequenced the bacterial DNA to retrieve the gif and the image, verifying that the microbes had indeed incorporated the data as intended.



This is the image:


The news article shows an image of a hand (shown above) and a short film (not shown here) of a horse rider that was encoded into the DNA "using a genome editing tool known as Crispr [sic]".


My question is, what does this mean? Did the scientists break down an image into 0's and 1's and (install?) it into bacteria? How does a scientist (download?) an image into bacteria and then (redownload?) the image later? How does DNA hold information of a picture that can be (downloaded)?



Answer




Just to add what might have been missing in the beautiful answer by @iayork. I just want to give a more simple picture of the encoding done in the E. coli DNA.




  • First for the rigid strategy in which 4 pixel colors were each specified by a different base, suppose we have a sequence:


    AAGCCCTGGTCAGCT


    Ignore the first AAG and start with C. Now, each base of DNA can represent a 2-digit binary number, and each number then corresponds to a color, like:


    C = 00


    T = 01


    A = 10


    G = 11



    With this strategy in mind, the sequence CCCT would give 00000001 pixet (or pixel set), and so on as the sequence grows. This pixet would define the color of four pixels in the image. Thus, each base corresponds to a pixel in the image, and the base defines the color of the pixel in a 4-color image.




  • Now, lets come to the flexible strategy. To begin with, see the table again:


    flexibe strategy table


    Here we are using standard 3-base codons. From the predefined value for each color (1 to 21), we can find the color using the codon. For example, from the same sequence:


    AAGCCCTGGTCAGCT


    Ignore AAG again and start with CCC. From the table, CCC encodes a value of 1. Move to next, TGG encodes a value of 16, TCA encodes 10 and GCT encodes 7, and so on for longer sequences. So, now we get an image with 4 pixels i.e. 2 x 2 with the pixels having color code 1, 16, 10, 7. In this way, each pixel can have a color from predefined values. On extracting this data, the image comes out as (from gizmodo):





image


The above part talked mostly about the single image of a hand. Now, talking about the horse-riding GIF, the process is almost the same. Here, we have to encode 5 images instead of one. Scientists encoded these 5 images in 5 different cells. After culturing them for some generations, they extracted the information of all images (using standard bioinformatics tools) and compiled them to get the GIF back. The initial and final GIFs look like this (from wired.com):


GIF


What do these rigid and flexible mean?


In this technique, the terms rigid and flexible are more about individual base rather than the codon. In the rigid strategy, the value of each base is fixed i.e. rigid. For example, in any sequence, C will encode the value '00', whatever the next or previous base is. This means that in both CCCT and GGTC, C has its rigid value '00'. So, for a 4-color image, where each base rigidly corresponds to the color of a pixel, we get as many pixels as the bases in the sequence.


On the other hand, in the flexible strategy, the individual bases do not have a fixed value, and the overall value of a pixet is defined by all the bases encoding that pixet. For example, TCC encodes a value of 6 while CCC encodes 1. The value of individual base is degenerate (or flexible), hence the name flexible strategy.


Thus, in a nutshell, while the rigid strategy is more efficient since one pixel is defined by one base (whereas in flexible strategy, one pixel is defined by one codon), the flexible strategy is better suited for getting more colored images since you get more color options by increasing the number of bases in a codon (whereas you only get 4 colors in rigid strategy, defined by 4 bases).


Why are we ignoring AAG?


As @canadianer points out in their answer, AAG is a PAM i.e. Protospacer Adjacent Motif. According to Wikipedia:




Protospacer adjacent motif (PAM) is a 2-6 base pair DNA sequence immediately following the DNA sequence targeted by the Cas9 nuclease in the CRISPR bacterial adaptive immune system. PAM is a component of the invading virus or plasmid, but is not a component of the bacterial CRISPR locus.



In simple terms (avoiding technical details), PAM is required for the CRISPR to function, but is not a part of the sequence itself. Much like a punctuation, it is necessary for proper functioning of CRISPR, but it is not to be read for encoding/decoding purpose. For the Cas9 found in E. coli (and is the most popular one), the sequence AAG serves as a PAM and is thus not used for encoding purpose here. Scientists also avoided to use AAG in their pixets so that there wouldn't be more than one recognition site for integration (ignore this point if you're unaware of the working of CRISPR).


Reference: Shipman, S., Nivala, J., Macklis, J. and Church, G. (2017). CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature. http://dx.doi.org/10.1038/nature23017


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...