Think Gene Think Gene RSS

a bio blog about genetics, genomics, and biotechnology

DNA Helix

Posts Tagged ‘bioinformatics’

Comp Sci Sins of Biologists

Bits per Base

A commenter mentioned that they heard nucleotide bases took 7 bits to store.

7 bits is the encoding for ASCII characters, which are used to store literal “A T C G”s in a text editor. There are 4 bases, so one only need 2 bits (22 = 4 bases). These bases could be numbered like this:

00 = A
01 = G
10 = C
11 = T

This encoding further has the convenient property that the bits can be inverted to get the complementary DNA strand. Storing bases as ASCII is OK for small, human readable files, but otherwise, it’s a gross waste of storage, bandwidth, and processor resources (about a 350% waste). This “data inflation” could be much worse if the files are encoded using unicode or other bigger character sets most used in foreign countries.

Abbreviations

Josh, my science editor, will disagree about this because “biologists don’t use the Internet” and “base 2 is for nerds,” but PLEASE, define all acronyms and unit abbreviations in a glossary! I was reviewing “A Short Guide to the Human Genome” by Cold Spring Harbor Lab Press, and in a table of chromosome sizes, the data is measured as Mb (with no explanation).

What is “Mb?”

DATA: “Mb” is “megabits,” which is 220 bits. Each base is two bits, so 219 = 524288 bases per unit “Mb.”

BIOLOGY: “Mb” is “megabases,” which is 1,000,000 bases per unit “Mb.”

Either interpretation is valid, and this is a serious problem as biology and computer science continue to collaborate. If NASA and Lockheed Martin can bungle units at the cost of a $327MM Mars Climate Orbiter extraterrestrial nose dive, you can certainly bungle a genomic experiment due to confused units, too.

There’s a joke that biologist’s don’t have new math, so they invent vocabulary to keep others out of their field. Please do not led credence to this joke.


Josh (edit):

I don’t say that “biologists don’t use the internet”, but it’s generally not an issue to know whether you’re using megabytes or megabases; the context tells you. If you’re dealing with DNA, it just doesn’t make sense to measure it in computational units of storage (ie megabytes), because this is effectively meaningless. If a segment of DNA is, say, 5Mb, the sentence doesn’t really make sense if you had a 5 megabyte fragment of DNA.

I suppose some people may get confused, but I think generally it’s a non-issue. In that particular book, if it’s targeted to people familiar with the field, they will know what Mb stands for. However, if it’s an introductory book, some explanation probably should be given.


Andrew (edit edit):

The quote my from gchat logs is:

Josh: I disagree, but whatever. CS corrupted the metric system with base 2

7:33 PM well that’s cuz they used US and metric units
me (Andrew): it doesn’t matter who’s wrong
it matters that people define their acronyms and units
Josh: lol ok, you can go ahead and post it, and I’ll disagree in a comment haha
are you reading those notes? or something else?
me: what, that people can make up acronyms?
7:34 PM Josh: but they aren’t making them up….
I dunno. I always knew what Mb was referring to in bio context
me: well, I mean using them carelessly
I’m just arguing for better communication
and more precision
how can one argue against that?
7:35 PM because I’m a comp sci
and I was confused
and it could be true for anyone else, too
especially if I look it up on the internet
to help to learn the vocab
7:36 PM which says “Mb = megabits”
Josh: lol
well, I guess the thing is that bio people aren’t on the internet as much
you cna’t really learn it on the internet
cuz it’s such a different field
7:37 PM me: well, now it is

The debate continues!

Josh:

well, I mean they don’t program. they would never confuse that, or really think anything of it I guess
10:20 PM me (Andrew): but comp scis will be confused
yes, it’s ok if only biologists ever only read what biologists write
Josh: not necessarily…..it depends on the context. Mb is length
10:21 PM it just doesn’t make sense to use megabytes for DNA. they are totally different things. a megabyte of DNA is meaningless in bio
also with how you say with compression. it could be compressed, it may not be. how is it stored? ascii or in the most efficient?
10:22 PM sure the book should prob say megabases….but I don’t think it’s really much of an issue to say it all the time
10:23 PM like… there is bound to be overlap between acronyms in any discipline, but I guess you just have to realize what you’re talking about and what makes sense
but I can understand you not knowing what it is if you never heard the term megabases
10:24 PM but if you knew that dna was measured in length and kilo/mega bases, then you’d see Mb or Kb and know what it was
ahh, maybe that’s what I’m trying to say
if you know kilobases and megabases are common ways to talk about the size of DNA, then if you saw the acronym in context you’d know what it was referring to
10:25 PM me: I’m saying that scientists should write to be cross-displinary
and that the unit of “size”
is the same abbreviation
Josh: lol you haven’t seen much of bio yet have you? EVERYTHING is acronyms
me: in both data and biology
Josh: because it’s a bitch to write it out…and it’s not usually necessary
me: I’m saying that’s particularly egregious
10:26 PM Josh: ehh. well, go ahead and try to convince people lol. but I doubt many people will change
me: lol ok, fine, I will. I’ll post this continued debate to the post, even
Josh: haha ok

Software developed by Boston College lab delivers speed and accuracy to genome research

It took a global corps of scientists approximately $500 million and 13 years to identify the more than 35,000 genes of the human genome. Five years later, Boston College Biologist Gabor Marth and his research team have developed software that can analyze half a million DNA sequences in 10 minutes.The Marth laboratory’s proprietary PyroBayes software is one of a new breed of computer programs able to accurately process the mountains of genome data flowing from the latest generation of gene decoding machines, which have placed a premium on computational speed and accuracy in data-crunching fields known as bioinformatics and high-throughput biology, said Marth, an associate professor of Biology.

“We’re on the edge of a real technological revolution that I think will help us understand the genetic causes of diseases in humans and how genetic materials determine traits in animals,” said Marth. “It is going to lead to less expensive technologies that will allow researchers to decode any individual.” … Continue Reading »