Think Gene Think Gene RSS

a bio blog about genetics, genomics, and biotechnology

DNA Helix

How Much Data is a Human Genome? Not Much.

I recently noted in Napster of Medicine that an entire human genome would fit on a music CD.

How much data IS a human genome?

  • 2 bits per base (4 bases = 22)
  • 3,080.4 Mb per human genome [1]
  • 700 MB per CD-ROM
(1 human genome) *
(3,080,400,000 bases / 1 human genome) *
(2 bits / 1 base) *
(1 byte / 8 bits) *
(1 MB / 1,048,576 bytes) =

734.4 MB per uncompressed human genome. Easily enough to fit on a 700 MB with basic file compression like gzip.

Actually, while writing this post I invented a technique to get the file size down to about 10MB, but I need to file a patent before disclosing. Sorry. (yes, 10MB, as in, the size of an mp3 song)

NOTE: Commenter “neandrothal” noted that this is the size of a haploid human genome. Humans are diploid: they two of each autosome and two sex chromosomes. So this is the size of a reference haploid human genome, not a complete human individual genome, which would be twice as much data. (2 music CDs) Thanks, neandrothal!

[1] Scherer, Stewart. 2007. A Short Guide to the Human Genome. 6.

MB = megabytes
Mb = megabase

Viewing 45 Comments

    • ^
    • v
    3 billion is the size of a *haploid* human genome. Since we have two copies of each of our chromosomes (except for men and the sex chromosomes), technically the number of bases is 6 billion, though of course the vast majority of these will be the same between two homologous chromosomes. But I'm sure the compression algorithm would recognize that...
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    Pretty cool. Just 10MB with the right compression. But I always thought it takes 7 bits? Thanks for letting me know. Palonek @ http://www.edwardpalonekblog.ca/
    • ^
    • v
    neandrothal: Yes and yes. I will make a note of that because it's a good point.

    Palonek: Why 7 bits? That's 27 = 128. You may be thinking of ASCII, which is 7 bits, to write the literal letters "A G C T." If ASCII is the encoding your biotech or lab uses for massive DNA files, you are over 3.5 times the data (so 3.5 the bandwidth, 3.5 the storage, and sometimes 3.5 the processing power.) That's bad.
    • ^
    • v
    There is also evidence that skipping breakfast is now common in the developed world: in the USA, the proportion of adults eating breakfast fell from 86% to 75% between 1965 and 1991.
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    Drew's assumption would be what I would do for storing this kind of data as, of course, there is a lot of it.
    • ^
    • v
    Humans are diploid: they two of each autosome and two sex chromosomes. So this is the size of a reference haploid human genome, not a complete human individual genome, which would be twice as much data. (2 music CDs) Thanks, neandrothal!
    • ^
    • v
    Your calculations are pretty much correct. The reference human genome still contains some unknown portions, so you need to be able to represent at least one possibility in addition to ACGT. But since you were talking probably about the real human genome, not the current unfinished data, that problem wouldn't apply.

    Using the ".2bit" format, human genome version "hg18" fits into a file listed here as 770 MB.

    http://hgdownload.cse.ucsc.edu/goldenPath/hg18/...

    The format is described here:

    http://genome.ucsc.edu/FAQ/FAQformat#format7

    As well as the earlier, and still sometimes useful, "nibble" format that used 2-bases per byte.

    In biology, the sequence of ACGT isn't all that contains inherited information. (There is also all the proteins you inherit along with DNA, and DNA methylation, and lots more stuff still to discover.) But I wouldn't know where to start to compute the information content there.
    • ^
    • v
    Ed, you bring up a very good point about methylation and other proteins on the DNA. If you only care about sequence, these things don't matter, but they definitely influence which genes are active or repressed, and even how active a gene is. I suppose it depends on what you're storing the data for.
    • ^
    • v
    As for the hope of crunching 770MB down to 700MB, it should be noted that programs like gzip rarely get better compression than 2 bit per nucleotide[1].

    Also, FASTA and other file formats are not primarily used for storage but transport. Almost no bioinformatics program operates directly on ASCII-data, but transforms such exchange formats to some internal representation.

    For the 10MB I guess the author thinks in terms of working on a diff with respect to some reference genome. While that is probably workable for applications on the human genome, it's not really patentable (UNIX patch and diff being older than me and there's probably even older prior art) and impracticable on a general scale. Impracticable because an index for describing any sequence in such a relative way would be far too big, i.e. it would probably require more storage than only transferring the sequences worked on directly.

    [1]
    • ^
    • v
    Thomas: I'd say that most bioinformatics scripts and programs operate on ASCII data, bit-packing data before is rather the exception for the few hard-core tools like BLAT/BLAST, etc. Most everyday scripts still parse fasta to strings and operate on them, I'd say.
    • ^
    • v
    In truth, human genomes can be more complex than even diploid (think CNV). This is especially true for cancer genomes. You may also want to capture more than one genomes in the reference, e.g., you may want to include the variations in dbSNP in your reference. To include just SNPs, you could expand your four-letter alphabet to include all the IUPAC DNA codes. Including indels would be even more complicated. Here are some thoughts on how to represent a complex reference genome.
    • ^
    • v
    For storing the methylation information you will need another bit for each base. This makes it 3 CDs :-)
    • ^
    • v
    Now, please do take a look at your fingertips. You ll see the fine lines of your fingerprint pattern. It is unique, and can be used to indentify a human; so fine and even much finer structures are defined in your organism.
    Now, how high would be only 3D positional information content needed to describe a human?
    You would need to position single cells, define the inner structure of particular cell types, describe the form of single nerve cells (dendrites)...etc
    Now how many cells are there in the human organism?
    Wihout any calculation, we can see the information quantity to describe a human in uncounted Terrabytes. Human chromosomes contain , as calculated here, 740 MB.
    So, why for the God's sake do we beleive that the whole of our hereditary information resides in the genes?
    • ^
    • v
    Because it's patterns. It's 1 gene per cell, it's instructions that may say "keep making these until chemical_gradient_$c-32 falls below threshold x and then stylize them based on concentration of chemical_gradient_^f-03"

    Well, maybe that's a bit hard to follow so let's try this instead: how many hairs are on your arm? Well, I don't really care about the particular number but what I want to know is if you had the same number of hairs on your arm when you were a child, and I mean the three foot tall variety.

    No, no you didn't. You had many fewer BUT they were about the same distance apart. Now, I'm sure you know that your arms don't just grow at the ends- there's a lot of growth in the middle and it's more or less continuous... but how could you add new hairs evenly spaced in that?

    Well it's simple. Much like our DNA you just need two values to keep track of it (though it's not really bits, it's not THAT simple.) You need a protein that causes hairs to grow and you need a protein that prevents them from growing. Like a lot of things in our body the protein that prevents hair from growing just stops cells from making the hair that promotes hair formation but the promoting protein promotes the preventer and promotes itself. There's another trick though. The preventer moves around between cells much more easily than the promoter.

    No need to do mental gymnastics here, I'll just state the end result: cells in high concentration of the promoter make enough of it to overcome the effects of the preventer and low concentrations just pool up on the preventer... up to a point. If there aren't any hairs close enough to prevent another from growing they don't have enough of the preventer so the promoter takes over and gives you another hair.

    A similar set up is also used to make sure you don't grow two heads. In fact this kind of thing is used so often that we can safely say the information used to build your body is many many times smaller than the actual information it would take to record the current state of your body.

    If you're much of a programmer you know how just a few lines of code (file might end up being a few kb if you didn't want it really small,) could produce an image of many gigabytes in size, if you had some reason to let it make a large enough image.

    Don't get me wrong though. There is more to us than our DNA.
    Our DNA basically lays out the boundaries of what we can possibly grow to be and the environment we grow in narrows it down until we reach that single possibility that is ultimate "you."
    • ^
    • v
    Well, it is not hard to follow what you explain. It could be sumarized as data compression of the hereditary information.
    ( like in the compressed image:... 255, write the white pixel 1244 times in this line and then gray 110, 86 times...)

    Data compression is the least that we can expect from the so obviously ingenious nature of living things. I certainly do not expect the structure of a skin cell to be written down as many times as there are cells ; or mitohondria described x-times, etc.
    One time is enough. You still have to position the cells precisely along the lines of the fingerprints, and i do not want to even mention the brain. The exact position of a single hair - 2 mm right or left might be of low priority to the organism, but the way the nerve cells are connected certainly isn't.

    Tissues and cells have spatial relative positions and shapes. With all the compression the hereditary information is expectedly subjected to ( however and wherever stored), the size of the 'file' must still be enormous.

    Human brain only is said to have 100 billion (10E11) cells, and a multiple of that number in dendrites that realize the complex brain circuitry through synapses with other cells. Even if we take into account the certain existence of 'typical' circuits, amount of information needed to describe the brain remains mind-boggling.

    Even on the cell level, numerous cell types have very complex internal life with very intricate and ingenious chemical internal regulation and metabolism. This exsists in a scaled up form on the tissue, organ and organism level, too.

    It should strain any informed credulty a bit, that even the structure and functioning of the cell types in the human organism can be described with 740 MB, with the best compression methods thinkable.

    You probably mean fractals when you mention generating images with simple algorithms. Nature certainly uses fractal-similar shapes ( Broccoli, flowers, etc.) where it suits the function;(The nature uses simply everything:) but try describing the wing profile of a bird or brain circuitry, for that matter , with a fractal. The exact topology and shape of the last two are crucial to the function and cannot be left to the will of the wisp fractal - that is why you can not recognize it, the innumerable recursions of a simple form, in the design of a , say, human skull. We would end up in everything else but in the simple fractal formulas, trying to describe it mathematically.

    Why dont we start with something simpler; say fractalizing the shape of a ship's hull, or compressing a song recording to a fractal, before trying it on the forms of the higer organisms?

    The genetics has a problem here, and a big one , too: the location of the greatest part of the hereditary information is not known.
    Mentioning fractals to explain this looks akin to me to looking for a wonder. What can not be explained is mysterious; rationalism abhors the mysterious and any suggestion that things unknown may exist. Interestingly and typicaly for our age of reason, you search for the solution in the field we know something about- the fractals. Indeed, is there anything that we, Descartes grandchildren do not know?

    Wouldn't it be simpler to say 'We do not know', accepting that the answer may lay in the mysterious realm of the unknown?
    Socrates would have liked that answer better, I am sure.
    Denying the ignorance, one never starts searching for the answer.
    • ^
    • v
    740MB is the size of a human haploid nucleotide base string, not the data necessary to describe a mature human.

    We believe that most of our hereditary information resides in genes because it does. However, a genome, as you say, cannot possibly fully describe a mature human. A genome is more like a brief mathematical equation used to produce beautifully complex fractal design when fed with ambient noise and interpreted as colors and coordinates on a screen.
    • ^
    • v
    I am trying to draw your attention to this: The human, just like any other organism has its qualties determined, and their description then must reside somewhere. The amount of information needed to describe a human organism is enormous, the information amount carried by the genes very limited in comparison.
    Now, let's take a look at this possible analogy.
    Imagine you are demonstrating a PC to someone who has no idea of computers whatsoever, and has never seen one.( Increasingly difficult to find, but there must still be some around :)
    Ok , you show him how inputs on the keyboard produce results on the screen. Knownig nothing about the PC under the desk, our computer novice has to think that the keyboard alone causes all the fascinating happenings on the screen.
    Now our virtuous genetics has got hold of the keyboard - genes; making changes there changes the organism. But how for God's sake does it follow that all the hereditary information resides there, and nor on some 'HD' somewhere, away from the 'keyboard'?
    I am simply pointing out that the 'keyboard' has practically no data storage capacity for the task.

    'We believe that most of our hereditary information resides in genes because it does. '

    Oh, pardon the heresy involved, but I really don't know how do you know that.
    • ^
    • v
    My Pc only needs a tiny little fractal program to generate a fractal world of incredible complexity and beauty (e.g. the Mandlebrot set). Two identical program runs will produce identical outputs, unless I introduce a bit of noise. The earlier the noise, the wider the divergence. Hence identical twins have differing fingerprints.

    So I can believe the small numbers quoted.
    • ^
    • v
    I beleive you about the beauty of the fractals your PC creates out of a short equation. But what do you think, would it be possible to write a fractal that would replicate in 3D the face of your friend? Remember, we are speaking about the organism form here; a human face does not look like a cauliflower.

    Fractal pictures are generated through recursive use of an equation; it appears innumerable times in the created picture.
    In addition to the simple formula, creating such a picture needs a lot of processing power to apply the formula n- times. Just imagine calculating a fractal per hand.
    The same is true about the compression; in general, greater the compression ratio, more processing power is needed to create the compressed file, or to reconstruct the original one.

    Now, chromosomes are beleived to contain all the hereditary information of an organism. They contain a very small information quantity to describe the organisms of enormous complexity, and consequently some kind of data compression must then be at work here, to the ratios like millions to one. Let's allow even fractals as a compression method for the main forms and topologies of higher mammals, unlikely as it may seem. Whichever the way such unimaginably high compressions are to be achieved, an enormous processing power is necessary to 'read ' the stored information.

    Can any such processing power be indentified in the cell, say the fertillized egg- cell, or in any cells and tissues in the later embryonal stadia?
    That means , if we want the fantastic compressions, or even fractals as an explanation about the 'missing memory' in the cell, we are confronted with the 'missing processor ' problem :). Where is it? Wasn't it easier to confess ' We don't know' in the first place?

    More broadly formulated; does anyone know even in rough outlines how is the information from the chromosomes being transfered to and realized in the concrete shapes and forms of the organisms and their sub-structures? An answer describing how proteins are synthesized on the basis of the genes info would not tell me much about what I really ask here.
    If the answer is no, what could this idea on the fractals in genetics be but a vague hypothesis without any causal content? It is not enough to say 'Fractals do it!'. How do they do it? Or how does anything else do it? I do not think anyone could answer these questions today.

    Early in the 19 century, to explain the energy source of the sun, it has been proposed the sun is a heap of burning coal. The first idea that could come to one's mind at the time of industrial revolution was obviously the ubiquitous coal, powering its furnaces and steam engines:). The hypothesis has been taken seriously at first, only to be abandoned soon afterwards for its obvious inadequacy. Of course, nothing could have been known at the time about the nuclear processes in the sun, not even in roughest outlines. The real explanation came more then a hundred years later.

    I have a feeling that a similarily big chunk of knowledge is missing in the biology of today for a viable explanation of many important aspects of life, including the mechanisms of the transfer of the hereditary information. We have to do with the coal-heap explain- it-away theories, instead of an sincere and brave ' We don't know.'
    • ^
    • v
    Neandrothal is right, but if you know half the code can't you get the second half because G attaches to C and A attaches to T?
    • ^
    • v
    How DARE you patent it. You should share that with the world freely as a show of good faith. To do otherwise is reprehensible. Patent and copyright are the bane of our legal system at the moment. They stifle community and promote individualistic gain at the expense of the greater good of the community.
    • ^
    • v
    yes, some 750 megabyte for the haploid sequence is about right. To get the diploid sequence, you only need the "diff" file of SNPs, which will amount to a couple of megabyte, so 800 megabyte for the diploid genome would sound about right.

    As for your claim of compressing this down to 10 megabytes, this is completely unrealistic. Here is a 2008 paper estimating the entropy rate of the human genome:
    http://www.biomedcentral.com/1471-2164/9/509
    they come up with about 1.8 bits per base pair, which would mean that even with an optimal compression algorithm, the best you can hope for will be a compression by the order of 10%.

    Unless, of course, your compression "algorithm" itself contains 750 megabytes of data, and will only write out the differences of your genome to some reference genome. In this case, you can hope for "compression" by 99.5%, or down to a couple of megabytes. But this isn't "compression", it is transfer of information from the "data file" to the "program file".

    If you think that 800 Mb is "not much", well sure, you can store your genome on your ipod nano. Your body, however, stores it in each cell nucleus. This is data storage at the molecular level, far beyond the reach of our current technology. And then the information doesn't just sit there, but is being actively processed within the cell nucleus, a structure of the size of a few micrometers. This is beyond any realistic scope of human-made nanotechnology and will remain so for many years.
    • ^
    • v
    I didn't know how "neandrothal" would contain too much data but it seems the time is not far when Human Genome

    size 13 shoes
    • ^
    • v
    There is also evidence that skipping breakfast is now common in the developed world: in the USA, the proportion of adults eating breakfast fell from 86% to 75% between 1965 and 1991.
    • ^
    • v
    The reference human genome still contains some unknown portions, so you need to be able to represent at least one possibility in addition to ACGT. But since you were talking probably about the real human genome, not the current unfinished data, that problem wouldn't apply.
    • ^
    • v
    The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    To determine incense’s psychoactive effects, the researchers administered incensole acetate to mice. They found that the compound significantly affected areas in brain areas known to be involved in emotions as well as in nerve circuits that are affected by current anxiety and depression drugs.
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    The reference human genome still contains some unknown portions, so you need to be able to represent at least one possibility in addition to ACGT. But since you were talking probably about the real human genome, not the current unfinished data, that problem wouldn't apply.
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    Pretty cool. The reference human genome still contains some unknown portions.
    • ^
    • v
    Write the literal letters "A G C T." If ASCII is the encoding your biotech or lab uses for massive DNA files, you are over 3.5 times the data (so 3.5 the bandwidth, 3.5 the storage, and sometimes 3.5 the processing power.)
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence.
    • ^
    • v
    Let's allow even fractals as a compression method for the main forms and topologies of higher mammals, unlikely as it may seem. Whichever the way such unimaginably high compressions are to be achieved, an enormous processing power is necessary to 'read ' the stored information.
    • ^
    • v
    The two halves of the diploid genome are complementary to each other so technically you only need to store one half of it to have all the genome's information content. Although I may be wrong in this assumption. :)
    • ^
    • v
    As the comment above said there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    3 billion is the size of a *haploid* human genome. Since we have two copies of each of our chromosomes (except for men and the sex chromosomes), technically the number of bases is 6 billion, though of course the vast majority of these will be the same between two homologous chromosomes. But I'm sure the compression algorithm would recognize that...
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    • ^
    • v
    There is also evidence that skipping breakfast is now common in the developed world: in the USA, the proportion of adults eating breakfast fell from 86% to 75% between 1965 and 1991.
    • ^
    • v
    I certainly do not expect the structure of a skin cell to be written down as many times as there are cells ; or mitohondria described x-times, etc.
    One time is enough. You still have to position the cells precisely along the lines of the fingerprints, and i do not want to even mention the brain. The exact position of a single hair - 2 mm right or left might be of low priority to the organism, but the way the nerve cells are connected certainly isn't.
    • ^
    • v
    So this is the size of a reference haploid human genome, not a complete human individual genome, which would be twice as much data!

Trackbacks

  1. To Describe a Human | Think Gene
    August 31, 2008 @ 3:19 pm

    [...] Cekic writes in response to How Much Data is a Human Genome: Now, please do take a look at your fingertips. You ll see the fine lines of your fingerprint [...]

close Reblog this comment