Scientists are almost as susceptible to a certain type of urban myth as the rest of the population. One popular one was that there are 100 000 genes in the human genome. When the first estimates of “the number of genes”—I use quotes because exactly what constitutes a single gene is subtle, complex, and controversial—based on large-scale sequencing came in, lots of people who should have known better were terribly shocked at how small these newly informed guesses were: “What? Only ten thousand more than a fruit fly?!” It’s only by chasing the suspiciously round 100K to its source that you see what really happened. A crude, but sensible, estimate made by a necessarily primitive method was propagated through the literature without most of the people who propagated it bothering to check its origin. In other words: “A friend of a friend told me there were 100 000 genes in the human genome”. Eventually everybody accepted it because they had been taught it as an undergraduate (or even earlier).

There is a similarly widespread, but slowly crumbling, misperception that the machinery that copies your genes tracks along your DNA like a train on rails. A better analogy would be fixed tape head reading a tape as it spools by. (If you’re interested you can read this, one of the first articles I ever put on the Web.)

One scientific urban myth that I actually found myself quibbling with a Nobel prizewinner* about is the old “we have ninety-whatever percent of our DNA in common with gorillas/chimpanzees/monkeys” factoid—often used (not by him) as a reason why we should be much nicer to primates. We should be nicer to our biological cousins, but not because of bad and meaningless statistics about our supposedly shared genetic data. My friend Patrick (a former medic) likes to counter this one by saying that the sentences “Your husband is dead” and “Your husband is not dead” are over eighty percent identical. The contention is also wrong on other levels. I’m going to offer a few simplified ones.

Firstly, most of your DNA is “junk”, at least in the sense that it doesn’t lead to a classical gene product: a protein that makes up your body. (I should point out that there are lots of other ways a message in DNA can express itself.) Secondly, even the messages that are read contain stretches that are very tolerant of randomness—to the extent that many scientists have argued that most mutation in DNA is not selected for in the strict Darwinian sense at all, that is they consider most persistent genetic changes to be accidental ones. (Almost all of these scientists still believe in evolution, however.) Thirdly, the size of a difference in DNA that matters to the full development of a living thing can vary over a mind-boggling range. The difference between death and life can be a single character in an entire book of genetic information. Even more shocking, it is possible to create artificial so-called “knockout” mice missing one or more entire functional genes that get by just fine with these whole chapters ripped out of the books of their lives.

Insofar as I have a specialism it is protein bioinformatics. I am interested in the meanings of the messages from the genome that get out to do conspicuously useful work in living things. This recent paper suggests that, if we restrict the comparison between humans and chimpanzees to these signals alone, and if we frame our comparison in other, perfectly reasonable terms, even a casual observer of the data can reach a complete different conclusion about our relatedness to chimps—our closest relatives—than the ninety-something percent shared DNA story suggests.

Right now I’m working on something with a collaborator in Germany who downloaded one of my programs. Like most of the stuff I write it’s very simple. I’ve got a passable sort-of-biology degree and a better sort-of-physics degree, but I’ve never been the world’s greatest mathematician or computer scientist. Being a determined plodder has its advantages, though, as he pointed out to me in an email, mine was the first implementation of a classic method for analysing some simple properties of protein structures that he had ever understood. Why? Because I didn’t understand any of the existing ones enough to trust them for my purposes and most of them were derived from a couple of (or perhaps just one) early programs. So I built my own version from first principles. In doing so I reminded myself why very few people bother to do these kinds of jobs properly: because it’s bloody hard work. To make matters worse, it didn’t turn out to be useful for the question that I wanted to answer so I just put my creation out there for other people to use—though not without some arm-twisting from my boss.

Of course, I didn’t actually think anyone would use it so he’s just created a whole new pile of work for me 🙂 …

*[He used this in the context of a two-handed popular talk and might have done so at the suggestion of his co-presenter.]