Think Gene Think Gene RSS

a bio blog about genetics, genomics, and biotechnology

DNA Helix

Comp Sci Sins of Biologists

Bits per Base

A commenter mentioned that they heard nucleotide bases took 7 bits to store.

7 bits is the encoding for ASCII characters, which are used to store literal “A T C G”s in a text editor. There are 4 bases, so one only need 2 bits (22 = 4 bases). These bases could be numbered like this:

00 = A
01 = G
10 = C
11 = T

This encoding further has the convenient property that the bits can be inverted to get the complementary DNA strand. Storing bases as ASCII is OK for small, human readable files, but otherwise, it’s a gross waste of storage, bandwidth, and processor resources (about a 350% waste). This “data inflation” could be much worse if the files are encoded using unicode or other bigger character sets most used in foreign countries.

Abbreviations

Josh, my science editor, will disagree about this because “biologists don’t use the Internet” and “base 2 is for nerds,” but PLEASE, define all acronyms and unit abbreviations in a glossary! I was reviewing “A Short Guide to the Human Genome” by Cold Spring Harbor Lab Press, and in a table of chromosome sizes, the data is measured as Mb (with no explanation).

What is “Mb?”

DATA: “Mb” is “megabits,” which is 220 bits. Each base is two bits, so 219 = 524288 bases per unit “Mb.”

BIOLOGY: “Mb” is “megabases,” which is 1,000,000 bases per unit “Mb.”

Either interpretation is valid, and this is a serious problem as biology and computer science continue to collaborate. If NASA and Lockheed Martin can bungle units at the cost of a $327MM Mars Climate Orbiter extraterrestrial nose dive, you can certainly bungle a genomic experiment due to confused units, too.

There’s a joke that biologist’s don’t have new math, so they invent vocabulary to keep others out of their field. Please do not led credence to this joke.


Josh (edit):

I don’t say that “biologists don’t use the internet”, but it’s generally not an issue to know whether you’re using megabytes or megabases; the context tells you. If you’re dealing with DNA, it just doesn’t make sense to measure it in computational units of storage (ie megabytes), because this is effectively meaningless. If a segment of DNA is, say, 5Mb, the sentence doesn’t really make sense if you had a 5 megabyte fragment of DNA.

I suppose some people may get confused, but I think generally it’s a non-issue. In that particular book, if it’s targeted to people familiar with the field, they will know what Mb stands for. However, if it’s an introductory book, some explanation probably should be given.


Andrew (edit edit):

The quote my from gchat logs is:

Josh: I disagree, but whatever. CS corrupted the metric system with base 2

7:33 PM well that’s cuz they used US and metric units
me (Andrew): it doesn’t matter who’s wrong
it matters that people define their acronyms and units
Josh: lol ok, you can go ahead and post it, and I’ll disagree in a comment haha
are you reading those notes? or something else?
me: what, that people can make up acronyms?
7:34 PM Josh: but they aren’t making them up….
I dunno. I always knew what Mb was referring to in bio context
me: well, I mean using them carelessly
I’m just arguing for better communication
and more precision
how can one argue against that?
7:35 PM because I’m a comp sci
and I was confused
and it could be true for anyone else, too
especially if I look it up on the internet
to help to learn the vocab
7:36 PM which says “Mb = megabits”
Josh: lol
well, I guess the thing is that bio people aren’t on the internet as much
you cna’t really learn it on the internet
cuz it’s such a different field
7:37 PM me: well, now it is

The debate continues!

Josh:

well, I mean they don’t program. they would never confuse that, or really think anything of it I guess
10:20 PM me (Andrew): but comp scis will be confused
yes, it’s ok if only biologists ever only read what biologists write
Josh: not necessarily…..it depends on the context. Mb is length
10:21 PM it just doesn’t make sense to use megabytes for DNA. they are totally different things. a megabyte of DNA is meaningless in bio
also with how you say with compression. it could be compressed, it may not be. how is it stored? ascii or in the most efficient?
10:22 PM sure the book should prob say megabases….but I don’t think it’s really much of an issue to say it all the time
10:23 PM like… there is bound to be overlap between acronyms in any discipline, but I guess you just have to realize what you’re talking about and what makes sense
but I can understand you not knowing what it is if you never heard the term megabases
10:24 PM but if you knew that dna was measured in length and kilo/mega bases, then you’d see Mb or Kb and know what it was
ahh, maybe that’s what I’m trying to say
if you know kilobases and megabases are common ways to talk about the size of DNA, then if you saw the acronym in context you’d know what it was referring to
10:25 PM me: I’m saying that scientists should write to be cross-displinary
and that the unit of “size”
is the same abbreviation
Josh: lol you haven’t seen much of bio yet have you? EVERYTHING is acronyms
me: in both data and biology
Josh: because it’s a bitch to write it out…and it’s not usually necessary
me: I’m saying that’s particularly egregious
10:26 PM Josh: ehh. well, go ahead and try to convince people lol. but I doubt many people will change
me: lol ok, fine, I will. I’ll post this continued debate to the post, even
Josh: haha ok

7 Comments

  1. John C said,
    June 27, 2008 @ 3:34 am

    Well folks, we can up the ante a bit: codons!!
    On the way, we can note that in RNA, the nucleotide T is replaced by U. To represent all the codons (64 of them, 3 bases each) would require 6 bits, which is the same as for representing each of the 3 bases. So any effort to compress the representation stops here unless we are only going to look for the 20 amino acids; then we can take a bit off and we are down to five bits. The good stuff for what I write here is at: http://en.wikipedia.org/wiki/Genetic_code

    In that article you will find this interesting quote: “A comparison may be made with computer science, where the codon is the equivalent of a word, which is the standard “chunk” for handling data (like one amino acid of a protein), and a nucleotide for a bit.” Nooo…I am not going to defend this literally, because this communicates figuratively.

  2. sunny beach said,
    June 27, 2008 @ 4:26 am

    That’s scarcely enough to consign them to Dante’s Comp Sci Hell alongside the SOA Vendors and the Physicists. Plain-text formats have their merits, and gzip should obliterate the extra inefficiency anyway, no?

  3. Marcus Breese said,
    June 27, 2008 @ 10:32 am

    Actually, you need 4 bits to store nucleic acid sequences, if you include ambiguity codes as well… (http://www.bioinformatics.org/SMS/iupac.html)

    0001 A
    0010 C
    0100 G
    1000 T

    Now, if you don’t know the base, it’s 1111 (N), if it’s an A or a C, it’s 0011, etc…

    Using bases as a bitmap also makes comparisons much faster too… you can just AND each bitmap against each other and if the result is greater than zero, it’s a match.

  4. gwern said,
    June 27, 2008 @ 1:09 pm

    Gzip would impose a constant overhead (I don’t mean this in the algorithmic sense - I don’t actually know how gzip scales, probably O(n) or something since I would be surprised if it looks not at a fixed sliding window but the whole corpus), though, and it could disallow lots of stuff (like random access, maybe - or at least, I don’t know how to get the billionth nucleotide without decompressing the previous 999 million).

  5. Nimish said,
    June 27, 2008 @ 1:09 pm

    Generally DEFLATE won’t use a dictionary that large, and at some point it’ll stop growing.

  6. John C said,
    June 27, 2008 @ 6:02 pm

    So if we use the 20 amino acid trick (5-bits) then we have all we want, plus the start and stop codons for unprocessed sequence plus some room for a code representing an ambiguity (a codons worth).

    I think that we are studying completed and processed genomes at this point which means that all of the ambiguities have beem resolved. Some of the other ideas mentioned are important for sequences that still are being assembled. So, except for the necessary start and end codon, everything else would have been corrected.

  7. Eddie Pasternack said,
    June 28, 2008 @ 1:33 am

    $327MM ?? That’s a lot of Miller Mattles!

Leave a Comment