Big Data in Radiation Oncology


Highlight [306]: “big data” often refers to extensive records on a large number of patients, consisting of either structured or unstructured clinical information that can include patient characteristics, diagnostic and treatment history, genomic and molecular data, and sometimes billing history.

Highlight [306]: Since big datasets usually contain real-world information, their use could bridge gaps between clinical and translational studies.

Highlight [306]: Moreover, randomized controlled trials include, on average, 4% of all cancer patients, and are known to underrepresent minorities and other underserved populations. 3

National Population-Based Cancer Databases

Highlight [306]: The most notable population-based cancer databases are those assembled by NCI’s Surveillance, Epidemiology, and

It is difficult to rely on the billing history. There are several caveats associated with it and any data including needs to be cleaned up for say, accidental purchases or repeat purchases.

This is highly important in this context and will become more apparent during the later part of the paper. Especially for the genetic studies.

Cancer Genomics and Other

Highlight [307]: End Results (SEER) Program and the Centers for Disease Control and Prevention’s National Program of Cancer Registries (NPCR).

Highlight [307]: 35% of the US population. SEER gathers patient demographics (age, gender, race, ethnicity, and birthplace); cancer characteristics (tumor cell types, biological and clinical aspects, and some biomarker and genomic information on tumors); stages of disease; treatment information (surgery, radiation, chemotherapy, hormone therapy, and immunotherapy); and patient outcomes (vital status and cause of death)

Highlight [307]: Both longitudinal follow-up information and recurrence data are not included. Furthermore, often the available end point is overall survival, but no other details are given.

Highlight [307]: American Society of Clinical Oncology launched the CancerLinQ initiative, an attempt to assemble data from every cancer patient in the United States, and make them available for analyses.

Highlight [307]: CancerLinQ aggregates data from electronic health records (EHRs) via direct feeds without needing to reformat the data source. CancerLinQ then processes and transforms the datasets through cloud-based algorithms.

Highlight [307]: Flatiron Health, which has created OncologyCloud for this purpose. 15 Flatiron’s network comprises over 250 cancer clinics with 1.5 million active patients, compiled into a single data system via a cloud-based EHR platform.

Highlight [307]: Flatiron’s system has interestingly been both integrated with clinical genomics data from Foundation Medicine and been used by the Food and Drug Administration to evaluate the role of “real world evidence.

Highlight [307]: As of March 2019, the TCGA has sequenced and molecularly profiled tumors from over 33,000 individuals with nearly 70 different types of cancer − analyzing over 22,000 genes, and discovering 3,140,000 mutations.

Highlight [307]: The NCI’s Clinical Proteomic Tumor Analysis Consortium is a consortium of institutions and investigators that uses pan-“omic” analyses to evaluate the molecular basis of cancer.

There’s more than survival! They haven’t included the death records. In the absence of a uniform modifier, it is impossible to link the data- especially from the time of diagnosis to the time patient succumbs to the disease.

Highlight [308]: TARGET uses a “multiomic” approach and employs various sequencing and array-based methods to examine genomes and transcriptomes.

Highlight [308]: However, in some respects, we have only seen the tip of the iceberg and there remain vast amounts of data to be gathered and deciphered.

Common Myths and Misconceptions of Big Data Research

Highlight [308]: Databases containing information on molecular or other -omics data provide the greatest value and validity for big data research, whereas population registry databases containing primarily patient-level clinical information are more subject to bias.

Highlight [308]: However, tumor samples in TCGA were obtained from a highly selected population, many of whom were clinical trial participants. In addition, ethnic diversity is lacking in TCGA samples.

Highlight [308]: Big observational databases are best suited for generating hypothesis, especially studies using population-based registry databases. Big data should not be used for assessing causal risk factors or patient treatment outcomes, or conducting comparative effectiveness studies

Highlight [308]: However, when properly designed, observational studies using real-world data can be as valid as clinical trials. It is easy to confuse internal and external validity. It is plausible that real-world observational studies, either genomic- or population-based, need not conform to results from randomized clinical trials, just as rigorously tested in vivo studies often do not agree with in vitro studies.

Highlight [309]: Quality of the data is more important than quantity of the data, especially for certain questions − more is not necessarily better. Haphazardly collected and unprocessed data, even when analyzed with great accuracy, provide limited inherent value to the users. Clean and quality-controlled data are thus far more valuable and effective.

Highlight [309]: Patients will derive the most benefits from interdisciplinary collaboration of researchers. Clinicians and data scientists will benefit from collaborations with genetic epidemiologists to properly design and carry out population-based genomic studies.

Highlight [309]: Bioinformatics uses advanced mathematical algorithms and technological platforms to store and transform data into an interpretable format. I

Highlight [309]: Additionally, cognitive computing (ie, artificial intelligence) and machine learning are gaining popularity. With adequate “training,” these new technologies will be able to identify differential therapeutic outcomes of a particular therapy, develop cancer treatment pathways, discover new cancer etiologies, and help deliver personalized interventions.

Highlight [309]: Many of the current treatment paradigms are based on highly selected clinical trials encompassing only a small fraction of total patients with cancer.

Highlight [309]: The advent of big data with massive clinic-genomic variables in large patient populations will be able to dissect common malignancies into distinct subtypes.

Highlight [309]: Large prospective trials looking at a small subset of selected tumor subtypes will then be difficult to accomplish and less useful in this setting.

Highlight [309]: With meticulous study design, data quality assurance, and sound analytical strategy, meaningful clinic-genomic information and treatment outcomes can be collected to aid clinical decisions in real time.


Underline [309]: Dewdney SB, Lachance J: Electronic records, registries, and the development of “Big Data”: Crowd-sourcing quality toward knowledge. Front Oncol 6:268, 2016

Underline [309]: 7. Surveillance, Epidemiology, and End Results (SEER) Program. 2019. 1 May 2019).

Underline [310]: Singal G, Miller PG, Agarwala V, et al: Association of patient characteristics and tumor genomics with clinical outcomes among patients with non-small cell lung cancer using a clinicogenomic database. JAMA 321:1391-1399, 2019

Underline [310]: Rudnick PA, Markey SP, Roth J, et al: A description of the clinical proteomic tumor analysis consortium (CPTAC) common data analysis pipeline. J Proteome Res 15:1023-1032, 2016

Underline [310]: Spratt DE, Chan T, Waldron L, et al: Racial/ethnic disparities in genomic sequencingracial/ethnic disparities in genomic sequencingracial/ethnic disparities in genomic sequencing. JAMA Oncol 2:1070-1074, 2016

Underline [310]: Whittemore AS, Nelson LM: Study design in genetic epidemiology: theoretical and practical considerations. J Natl Cancer Inst Monogr 1999: 61-69. PMID: 10854488

Underline [310]: Wang Q, Lu Q, Zhao H: A review of study designs and statistical methods for genomic epidemiology studies using next generation sequencing. Front Genet 6:149, 2015