A Petabase: no ordinary number
January 20, 2016
As Head of Bioinformatics at the Genome Sciences Centre, my role is to oversee the computational analysis of the DNA sequence data that we are generating. While sequencing technologies can now rapidly produce copious amounts of raw DNA sequence, computational challenges remain.
Currently, we have over nine petabytes of disk space – that’s nine million gigabytes, or the equivalent space on more than 70,300 of the best iPads. And there are over 8,000 CPUs churning away 24/7 analyzing these sequences.
The human genome contains around three billion base pairs of DNA sequence. Tumour cells accrue changes, or mutations, to this sequence, accounting for their abnormal behaviour. For most tumours, the number of changes that are causing the disease will likely be no more than a few dozen. That's a very small number of needles in a pretty big haystack. Our goal is to apply the computational tools to find these mutations and understand the role they are playing in causing disease.
Knowing the repertoire of changes that cause cancer will provide the underlying information to fuel the development of the next-generation of anti-cancer drugs and to understand which tumours will respond to what drugs.
No ordinary number
As I write this week, we have just celebrated a sequencing milestone. We have now sequenced one petabase. A petabase is one thousand trillion base pairs of DNA sequence – over 33,000 times more sequence than was generated by the international human genome project.
This is a remarkable achievement – one that I could never imagined achievable when the Genome Sciences Centre was first formed. It’s a testament to all the hard work of everyone at the Centre.
While this petabase represents a large number that is difficult for the human brain to comprehend – within it lie many scientific firsts. From the first genomic sequence of relatively common cancers, such as breast, to those of many rare tumour types, parathyroid cancer, for example. Within the dataset also lies some of the largest cancer genome studies ever conducted. Almost as an aside and represented by the tiniest sliver of a fraction of this petabase is the first genomic sequence derived for the SARs virus. It was especially rewarding that the expertise of the Genome Sciences Centre could be called upon in 2003 when the world was desperately looking for answers.
There is one genome within this petabase I think is particularly pertinent and important. It was derived from an uncommon cancer of the tongue, but it was the first cancer genome ever to be sequenced to aid in clinical decision making and patient care. This genome represents the vanguard in how I envisage both cancer care and research will be transformed in the near future. Such a study would not have been possible through the conservative and lengthy review process typical with granting agencies. Crucially, it was support and funding from the BC Cancer Foundation that allowed this medical first to be achieved.
* photo: Martin Krzywinski, Genome Sciences Centre (www.lumondo.com)