The Cancer Genome Atlas Project (TCGA): Understanding Glioblastoma

TCGAIn 2003, Cold Spring Harbor Laboratory (CSHL) and researchers around the world celebrated the 50th Anniversary of the discovery of the structure of DNA by Jim Watson and Francis Crick.  I was a graduate student in the Watson School of Biological Science at CSHL, named after James Watson who was the chancellor of the CSHL, and in 2003, I participated in (and planned!) some of the 50th anniversary events. Coinciding with this celebration was a meeting about DNA that brought world-renowned scientists and Nobel Prize winners from around the world to CSHL to celebrate how much had been accomplished in 50 years (including sequencing the human genome) and to look to the future for what could be done next. That meeting was the first time I had heard about the Cancer Genome Atlas Project. At this point, the TCGA (as the project was affectionately called) was just a pipe dream – a proposal by the National Cancer Institute and the National Human Genome Research Institute (two institutes in the National Institutes of Health – the NIH).  The idea was to use DNA sequencing and other techniques to understand different types of cancer at the genome level. The goal was to see what changes are happening in these cancer cells that might be exploited to detect or treat these cancers.  I remember that there was a heated debate about whether or not this idea would work. I was actually firmly against it, but now with the luxury of hindsight, the scientific advances of the TCGA seem to be clearly worth the time and cost.

The first part of the TCGA started in 2006 as a pilot project to study glioblastoma multiforme, lung, and ovarian cancer. In 2009, the project was expanded, and in the end, the TCGA consortium studied over 33 cancer types (including 10 rare cancers).  All of the data that was made publically available so that any results could be used by any scientist to better understand these diseases. To accomplish this goal, the TCGA created a network of institutions to provide the tissue for over 11,000 tumor and normal samples (from biobanks including the one that I currently manage).  These samples were analyzed using techniques like Next Generation Sequencing and researchers used heavy-duty computing power to put all of the data together. So what did they find? This data has contributed to hundreds of publications, but the one I’m going to talk about today is the results from the analysis of the glioblastoma multiforme tumors.

Title: Comprehensive genomic characterization defines human glioblastoma genes and core pathways published in Nature in October 2008.

Authors: The Cancer Genome Atlas Network

gbmBackground: Glioblastoma is a fast-growing, high grade, malignant brain tumor​ that is the most common brain tumor found in adults.  The most common treatments are surgery​, radiation therapy​, and/or chemotherapy (temozolomide​). Researchers are also testing new treatments such as NovoTFF, but these have not yet been approved for regular use. However, even with these treatments the median survival for someone diagnosed with glioblastoma is only ~15 months.  At the time that this study was published, little was known about the genetic cause of glioblastoma – a small handful of mutations were known, but nothing comprehensive. Because of the poor prognosis and lack of understanding of this disease, the TGCA targeting it for a full molecular analysis.

Methods: The TCGA requested tissue samples from glioblastoma patients from biobanks around the country. They received 206 samples that were of good enough quality to use for these experiments.  143 of these also had matching blood samples.  Because the DNA changes in the tumor only happen in the tumor, the blood is a good source of normal, unchanged DNA to compare the tumor DNA to. To these samples, the study sites did a number of different analyses:

  • They looked at the number of copies of each piece of DNA. This is called DNA copy number, and copy number is often changed in tumor cells (see more about what changes in the number of chromosomes can do here)
  • They looked at gene expression.  The genes are what makes proteins, which do all of the stuff in your body.  If you have a mutation in a gene, it could change the protein so that it contributes to the development of cancer.
  • They also looked at DNA methylation.  Methylation is a mark that can be added to the DNA telling the cell to turn off that part of DNA.  If there is methylation on gene that normally stops a cell from growing like crazy, that methylation would turn that gene off and the cell could grow out of control.
  • In a subset of samples, they performed next generation sequencing to know the full sequence of the tumor genomes.

Results and Discussion: From all of this data, the researchers found  quite a bit.

  • Copy number results: There were many differences in copy number including deletions of genes important for slowing growth and duplications of genes the told the cell to grow more.
  • Gene expression results: Genes that are responsible for cell growth, like the gene EGFR, were expressed more in glioblastoma tumor cells.  This has proven to be an interesting result because there are drugs that inhibit EGFR.  These drugs are currently being tested in the clinic to see if this EGFR drug is a good treatment for patients with a glioblastoma that expresses a lot of EGFR.
  • Methylation results: They found a gene called MGMT that is responsible for fixing mutated DNA was highly methylated.  This mutation was actually beneficial to patients because it made them more sensitive to the most common chemotherapy, temozolomide.  However, this result also suggests that losing MGMT methylation may cause treatment resistance.
  • Sequencing results: From all of the sequencing they created over 97 million base pairs of data! They found mutations in over 200 human genes. From statistical analysis, seven genes had significant mutations including a gene called p53, which usually prevents damaged cells from growing, but when mutated the cell can more easily grow out of control
glioblastoma_pathways

This is the summary figure from this paper that shows the three main pathways changed in glioblastoma and the evidence they found to support these genes’ involvement. Each colored circle or rectangle represents a different gene. Blue means that the gene is deleted and red means that there is more of that gene in glioblastoma tumors.

Bringing all of this data together, scientists found three main pathways that lead to cancer in glioblastoma (see the image above for these pathways).  These pathways provide targets for treatment by targeting drugs to specific genes in these pathways. Scientists also identified a new glioblastoma subtype that has improved survival​. This is great for patients who find out that they have this subtype!  Changes in the methylation also show how patients could acquire resistance to chemotherapy. Although chemotherapy resistance is bad for the patient, understanding how it happens allows scientists to develop drugs to overcome the resistance based on these specific pathways.

Although this is where the story ended for this article, the TCGA data has been used for many more studies about glioblastoma.  For example, in 2010, TCGA data was used to identify four different subtypes of glioblastoma: Proneural, Neural, Classical, and Mesenchymal that have helped to tailor the type of treatments use for each group. For example proneural glioblastoma does not benefit from aggressive treatment, whereas other subtypes do. Other researchers are using the information about glioblastoma mutations to develop new treatments for the disease

To learn more about the Cancer Genome Atlas Project, check out this article “The Cancer Genome Atlas: an immeasurable source of knowledge” in the journal or watch this video about the clinical implications of the TCGA finding about glioblastoma

How do we know the genome sequence?

Imagine someone asked you to explain how a car works. Even if you knew nothing about cars, you could take the car apart piece by piece, inspect each piece in your hand and probably draw a pretty good diagram of how a car is put together.  You wouldn’t understand how it works, but you’d have a good start in trying to figure it out.

Now what if someone asked you to figure out how the genome works? You know it’s made of DNA, but it’s the ORDER of the nucleotides that helps to understand how the genome works (remember genes and proteins?). All the time in the news, you hear about a scientist or a doctor who looked at the sequence of the human genome and from that information could conclude possible causes of the disease or a way to target the treatment. DNA sequencing forms a cornerstone of personalized medicine, but how does this sequencing actually work? How do you take apart the genome like a car so you can start to understand how it works?

As a quick reminder – DNA is made out of four different nucleotides, A, T, G, and C, that are lined up in a specific order to make up the 3 billion nucleotides in the human genome.  DNA looks like a ladder where the rungs are made up of bases that stick to one another: A always sticking to T and G always sticking to C.  Since A always sticks to T and G always sticks to C, if you know the sequence that makes up one side of the ladder, you also know the sequence of the other side.

DNA_ladder

The first commonly used sequencing is called Sanger sequencing, named after Frederick Sanger who invented the method in 1977. Sanger sequencing takes advantage of this DNA ladder – this method breaks it in half and using glowing (fluorescent) nucleotides of different colors, this technique rebuilds the other side of the ladder one nucleotide at a time. A detector that can detect the different fluorescent colors creates an image of these colors that a program then “reads” to give the researcher the sequence of the nucleotides (see image below to see what this looks like).  These sequences are just long strings of As, Ts, Gs, and Cs that the researcher can analyze to better understand the sequence for their experiments.

sanger_sequencing

This was a revolutionary technique, and when the Human Genome Project started in 1990, Sanger Sequencing was the only technique available to scientists. However, this method can only sequence about 700 nucleotides at one time and even the most advanced machine in 2015 only runs 96 sequencing reactions at one time.  In 1990, using Sanger sequencing, scientists planned on running lots and lots of sequencing reaction at one time, and they expected this effort would take 15 years and cost $3 Billion. The first draft of the Human Genome was published in 2000 through a public effort and a parallel private effort by Celera Genomics that cost only $300 million and took only 3 years once they jumped into the ring at 2007 (why was it cheaper and fast, you ask? They developed a fast “shotgun” method and analysis techniques that sped up the process).

As you may imagine, for personalized medicine where sequencing a huge part of the genome may be necessary for every man, woman, and child, 3-15 years and $300M-$3B dollars per sequence is not feasible. Fortunately, the genome sequencing technology advanced in the 1990s to what’s called Next Generation Sequencing. There are a lot of different versions of the Next Gen Sequencing (often abbreviated as NGS), but basically all of them run thousands and thousands of sequencing reactions all at the same time. Instead of reading 700 nucleotides at one time in Sanger sequencing, NGS methods can read up to 3 billion bases in one experiments.

How does this work? Short DNA sequences are stuck to a slide and replicated over and over. This makes dots of the exact same sequence and thousands and thousands of these dots are created on one slide. Then, like Sanger sequencing, glowing nucleotides build the other side of the DNA ladder one nucleotide at a time. In this case though, the surface looks like a confetti of dots that have to be read by a sophisticated computer program to determine the millions of sequencing.

NGS

So what has this new technology allowed scientists to do? It has decreased the cost of sequencing a genome to around $1000. It has also allowed researchers to sequence large numbers of genomes to better understand the genetic differences between people, to better understand other species genomes (including the bacteria that colonize us or the viruses that infect us), and to help determineexomee the genetic changes in tumors to better detect and treat these diseases. Next Generation Sequencing allows doctors to actually use genome sequencing in the
clinic. A version of genome sequencing has been developed called “exome sequencing” that only sequences the genes.  Since genes only make up about 1-2% of the genome, NGS of the exome takes less time and money but provides lots of information about what some argue is the most important part of the genome – the part that encodes proteins.  Much of the promise of personalized medicine can be found through this revolutionary DNA sequencing technique – and with the cost getting lower and lower, there may be a day soon when you too will have your genome sequence as part of your medical record.


For more information about the history of Sequencing, check out this article “DNA Sequencing: From Bench to Bedside and Beyond” in the journal Nucleic Acids Research.

Here is an amusing short video about how Next Generation Sequencing works described by the most interesting pathologist in the world.