Think Gene Think Gene RSS

a bio blog about genetics, genomics, and biotechnology

DNA Helix

How Much Data is a Human Genome? Not Much.

I recently noted in Napster of Medicine that an entire human genome would fit on a music CD.

How much data IS a human genome?

  • 2 bits per base (4 bases = 22)
  • 3,080.4 Mb per human genome [1]
  • 700 MB per CD-ROM
(1 human genome) *
(3,080,400,000 bases / 1 human genome) *
(2 bits / 1 base) *
(1 byte / 8 bits) *
(1 MB / 1,048,576 bytes) =

734.4 MB per uncompressed human genome. Easily enough to fit on a 700 MB with basic file compression like gzip.

Actually, while writing this post I invented a technique to get the file size down to about 10MB, but I need to file a patent before disclosing. Sorry. (yes, 10MB, as in, the size of an mp3 song)

NOTE: Commenter “neandrothal” noted that this is the size of a haploid human genome. Humans are diploid: they two of each autosome and two sex chromosomes. So this is the size of a reference haploid human genome, not a complete human individual genome, which would be twice as much data. (2 music CDs) Thanks, neandrothal!

[1] Scherer, Stewart. 2007. A Short Guide to the Human Genome. 6.

MB = megabytes
Mb = megabase

Viewing 13 Comments

    • ^
    • v
    3 billion is the size of a *haploid* human genome. Since we have two copies of each of our chromosomes (except for men and the sex chromosomes), technically the number of bases is 6 billion, though of course the vast majority of these will be the same between two homologous chromosomes. But I'm sure the compression algorithm would recognize that...
    • ^
    • v
    Pretty cool. Just 10MB with the right compression. But I always thought it takes 7 bits? Thanks for letting me know. Palonek @ http://www.edwardpalonekblog.ca/
    • ^
    • v
    neandrothal: Yes and yes. I will make a note of that because it's a good point.

    Palonek: Why 7 bits? That's 2<sup>7</sup> = 128. You may be thinking of ASCII, which is 7 bits, to write the literal letters "A G C T." If ASCII is the encoding your biotech or lab uses for massive DNA files, you are over 3.5 times the data (so 3.5 the bandwidth, 3.5 the storage, and sometimes 3.5 the processing power.) That's bad.
    • ^
    • v
    As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    Drew's assumption would be what I would do for storing this kind of data as, of course, there is a lot of it.
    • ^
    • v
    Your calculations are pretty much correct. The reference human genome still contains some unknown portions, so you need to be able to represent at least one possibility in addition to ACGT. But since you were talking probably about the real human genome, not the current unfinished data, that problem wouldn't apply.

    Using the ".2bit" format, human genome version "hg18" fits into a file listed here as 770 MB.

    http://hgdownload.cse.ucsc.edu/goldenPath/hg18/...

    The format is described here:

    http://genome.ucsc.edu/FAQ/FAQformat#format7

    As well as the earlier, and still sometimes useful, "nibble" format that used 2-bases per byte.

    In biology, the sequence of ACGT isn't all that contains inherited information. (There is also all the proteins you inherit along with DNA, and DNA methylation, and lots more stuff still to discover.) But I wouldn't know where to start to compute the information content there.
    • ^
    • v
    Ed, you bring up a very good point about methylation and other proteins on the DNA. If you only care about sequence, these things don't matter, but they definitely influence which genes are active or repressed, and even how active a gene is. I suppose it depends on what you're storing the data for.
    • ^
    • v
    As for the hope of crunching 770MB down to 700MB, it should be noted that programs like gzip rarely get better compression than 2 bit per nucleotide[1].

    Also, FASTA and other file formats are not primarily used for storage but transport. Almost no bioinformatics program operates directly on ASCII-data, but transforms such exchange formats to some internal representation.

    For the 10MB I guess the author thinks in terms of working on a diff with respect to some reference genome. While that is probably workable for applications on the human genome, it's not really patentable (UNIX patch and diff being older than me and there's probably even older prior art) and impracticable on a general scale. Impracticable because an index for describing any sequence in such a relative way would be far too big, i.e. it would probably require more storage than only transferring the sequences worked on directly.

    [1]
    • ^
    • v
    Thomas: I'd say that most bioinformatics scripts and programs operate on ASCII data, bit-packing data before is rather the exception for the few hard-core tools like BLAT/BLAST, etc. Most everyday scripts still parse fasta to strings and operate on them, I'd say.
    • ^
    • v
    In truth, human genomes can be more complex than even diploid (think CNV). This is especially true for cancer genomes. You may also want to capture more than one genomes in the reference, e.g., you may want to include the variations in dbSNP in your reference. To include just SNPs, you could expand your four-letter alphabet to include all the IUPAC DNA codes. Including indels would be even more complicated. Here are some thoughts on how to represent a complex reference genome.
    • ^
    • v
    For storing the methylation information you will need another bit for each base. This makes it 3 CDs :-)
    • ^
    • v
    Now, please do take a look at your fingertips. You ll see the fine lines of your fingerprint pattern. It is unique, and can be used to indentify a human; so fine and even much finer structures are defined in your organism.
    Now, how high would be only 3D positional information content needed to describe a human?
    You would need to position single cells, define the inner structure of particular cell types, describe the form of single nerve cells (dendrites)...etc
    Now how many cells are there in the human organism?
    Wihout any calculation, we can see the information quantity to describe a human in uncounted Terrabytes. Human chromosomes contain , as calculated here, 740 MB.
    So, why for the God's sake do we beleive that the whole of our hereditary information resides in the genes?
    • ^
    • v
    740MB is the size of a human haploid nucleotide base string, not the data necessary to describe a mature human.

    We believe that most of our hereditary information resides in genes because it does. However, a genome, as you say, cannot possibly fully describe a mature human. A genome is more like a brief mathematical equation used to produce beautifully complex fractal design when fed with ambient noise and interpreted as colors and coordinates on a screen.
    • ^
    • v
    I am trying to draw your attention to this: The human, just like any other organism has its qualties determined, and their description then must reside somewhere. The amount of information needed to describe a human organism is enormous, the information amount carried by the genes very limited in comparison.
    Now, let's take a look at this possible analogy.
    Imagine you are demonstrating a PC to someone who has no idea of computers whatsoever, and has never seen one.( Increasingly difficult to find, but there must still be some around :)
    Ok , you show him how inputs on the keyboard produce results on the screen. Knownig nothing about the PC under the desk, our computer novice has to think that the keyboard alone causes all the fascinating happenings on the screen.
    Now our virtuous genetics has got hold of the keyboard - genes; making changes there changes the organism. But how for God's sake does it follow that all the hereditary information resides there, and nor on some 'HD' somewhere, away from the 'keyboard'?
    I am simply pointing out that the 'keyboard' has practically no data storage capacity for the task.

    'We believe that most of our hereditary information resides in genes because it does. '

    Oh, pardon the heresy involved, but I really don't know how do you know that.

Trackbacks

close Reblog this comment
blog comments powered by Disqus