Skip to Content

Controlling the Data Tsunami

Xi (Steven) Chen, Ph.D.: Biostatistican

March 21, 2012 | Leslie Hill

“Taking control of the data tsunami” – that’s how Xi (Steven) Chen, Ph.D., describes the work of a biostatistician.

“With one patient, you can have millions of genetic variations, so in a study with hundreds or thousands of patients, you can imagine how enormous the data set is,” said Chen, assistant professor of Biostatistics.

Generally speaking, biostatisticians are responsible for the data in a research study – in helping plan data collection and analyzing the resulting data. A biostatistician may help a researcher determine how many subjects are needed in experimental and control groups, build a statistical model that accounts for multiple variations simultaneously, or organize and report analysis results.

Chen is an expert in analyzing high-dimensional data sets, meaning the millions of genetic variants and their myriad expression patterns that make up that data tsunami.

“Ten years ago to get a whole genome sequenced was a huge project. But it is cheaper and easier every day, and eventually everyone will have their DNA sequenced, and we will have huge amounts of data to deal with,” he said. “People are saying right now that this high-dimensional genomic data is a gold mine, but the gold is not easy to find.”

Chen did help researchers strike gold in a study of triple-negative breast cancer. This subtype of breast cancer is only seen in about 10 percent of the population, so the first task was finding a large enough sample size for a meaningful study. Chen developed a model to sort through more than 3,000 publically available gene expression data sets to detect the patient samples with triple-negative breast cancer, and then developed another model to separate those into six subcategories. The investigators then analyzed gene expression and drug response in cell lines that represented the six subtypes.

“This is the first study to systematically investigate triple-negative breast cancer in a genomic way,” he said. “We’re still a long way from finding a solution, but this is the beginning. We know now that even in triple-negative breast cancer, it’s not just one simple disease, and the different subcategories respond to different drugs.”

Chen is also involved with a study to better rate a patient’s prognosis in colon cancer.

“With stage 2 and stage 3 colon cancer, we’ve found that patients differ dramatically in their metastatic risk, and that determines which treatment they are given.”

So researchers collected data on gene expression and environment to build an integrated prognosis model that predicts metastatic risk.

“Cancer is such a complex disease. Researchers often only look at one part of the data and fail to look at the whole picture. So what we try to do is integrate all of the data – DNA sequence, epigenetic markers, environment, gene expression, proteomics data, etc. We’re trying to bring all of that together to understand cancer.”

Chen is a native of China and, after obtaining an undergraduate degree in biochemistry in Lanzhou University, came to the United States for a master’s degree in molecular genetics. But a required class in basic statistics changed his whole course.

“I found the field fascinating and decided I could show my talents better there, so I made the switch and got a Ph.D. in statistics.”

He says working at Vanderbilt-Ingram Cancer Center is the perfect combination of his background in biology and affinity for numbers.

“You need insights in both areas to feel the research problems and look for solutions,” he said. “We are doing scientific research, just in a different fashion. Instead of a biologic experiment at a lab bench, we’re at a computer. We use different ways of thinking to solve cancer research problems.”


(Photo by Susan Urmy)