By studying changes in gene expression, researchers learn the way cells function on the molecular level, which could help them understand how certain diseases develop.
But an individual has about 20,000 genes that may influence one another in complex ways, so knowing which groups of genes to focus on is an enormously complicated problem. In addition, genes work together in modules that regulate one another.
MIT researchers have now developed theoretical foundations for methods that would determine the perfect strategy to group genes into related groups in order that they’ll efficiently learn the underlying cause-and-effect relationships between many genes.
Importantly, this recent method achieves this using only observational data. This implies that researchers do not need to conduct costly and sometimes infeasible intervention experiments to acquire the information vital to infer underlying causal relationships.
In the long run, this system could help scientists discover potential gene targets to trigger specific behaviors more accurately and efficiently, potentially allowing them to develop precise treatments for patients.
“In genomics, it is rather vital to grasp the mechanism underlying cell states. But cells have a multiscale structure, so the degree of summary can also be very vital. “If you determine find out how to properly aggregate the observed data, the knowledge you get in regards to the system ought to be more interpretable and useful,” says graduate student Jiaqi Zhang, a fellow on the Eric and Wendy Schmidt Center and co-lead creator of a Paper on this system.
Zhang is joined on the paper by co-lead creator Ryan Welch, a current master's student in engineering; and senior creator Caroline Uhler, a professor within the Department of Electrical Engineering and Computer Science (EECS) and the Institute for Data, Systems, and Society (IDSS), who can also be director of the Eric and Wendy Schmidt Center on the Broad Institute of MIT and Harvard and researchers at MIT's Laboratory for Information and Decision Systems (LIDS). The research might be presented on the Conference on Neural Information Processing Systems.
Learning from observational data
The problem the researchers desired to tackle is learning programs from genes. These programs describe which genes work together to manage other genes in a biological process corresponding to cell development or differentiation.
Because scientists cannot efficiently study how all 20,000 genes interact, they use a way called causal disentanglement to learn find out how to mix related groups of genes right into a representation that permits them to efficiently study cause-and-effect relationships.
In previous work, researchers have shown how this might be done effectively within the presence of interventional data, which is data obtained through confounding variables within the network.
However, conducting interventional experiments is usually expensive, and there are scenarios where such experiments are either unethical or the technology just isn’t ok for the intervention to achieve success.
With only observational data, researchers cannot compare genes before and after an intervention to learn how groups of genes work together.
“Most research on causal disentanglement assumes access to interventions, so it was unclear how much information could possibly be disentangled using observational data alone,” says Zhang.
The MIT researchers developed a more general approach that uses a machine learning algorithm to discover groups of observed variables, corresponding to: B. Genes, to effectively discover and aggregate using only observational data.
You can use this system to discover causal modules and reconstruct an accurate underlying representation of the cause-effect mechanism. “While this research was motivated by the issue of elucidating cellular programs, we first needed to develop a brand new causal theory to grasp what could and couldn’t be learned from observational data. “With this theory, in future work we are able to apply our understanding to genetic data and discover gene modules and their regulatory relationships,” says Uhler.
A layered representation
Using statistical techniques, researchers can calculate a mathematical function called the variance for every variable's Jacobian. Causal variables that don’t affect subsequent variables must have zero variance.
The researchers reconstruct the representation in a layer-by-layer structure by first removing the variables in the bottom layer which have zero variance. Then they work backwards, layer by layer, removing the variables with zero variance to find out which variables or groups of genes are connected.
“Identifying the variances which can be zero quickly becomes a combinatorial goal that is kind of difficult to resolve. “So it was an enormous challenge to develop an efficient algorithm that would solve this problem,” says Zhang.
In the tip, their method provides an abstracted representation of the observed data with layers of interrelated variables that accurately summarizes the underlying cause-and-effect structure.
Each variable represents an aggregated group of genes working together, and the connection between two variables represents how one group of genes regulates one other. Your method effectively captures all the knowledge used to find out each level of variables.
After proving that their technique was theoretically sound, the researchers ran simulations to indicate that the algorithm could efficiently disentangle meaningful causal representations using observational data alone.
In the longer term, the researchers wish to use this system in real genetic applications. They also wish to explore how their method could provide additional insights in situations where intervention data can be found or help scientists understand find out how to develop effective genetic interventions. In the longer term, this method could help researchers more efficiently determine which genes work together in the identical program, which could help discover drugs that focus on those genes to treat specific diseases.
This research is funded partly by the MIT-IBM Watson AI Lab and the US Office of Naval Research.