The Chan Zuckerberg Initiative announced Thursday the launch of rBio, the primary artificial intelligence model trained to reason about cellular biology using virtual simulations reasonably than requiring expensive laboratory experiments — a breakthrough that would dramatically speed up biomedical research and drug discovery.
The reasoning model, detailed in a research paper published on bioRxiv, demonstrates a novel approach called “soft verification” that uses predictions from virtual cell models as training signals as a substitute of relying solely on experimental data. This paradigm shift could help researchers test biological hypotheses computationally before committing time and resources to costly laboratory work.
“The idea is that you’ve gotten these super powerful models of cells, and you should use them to simulate outcomes reasonably than testing them experimentally within the lab,” said Ana-Maria Istrate, senior research scientist at CZI and lead creator of the research, in an interview. “The paradigm to this point has been that 90% of the work in biology is tested experimentally in a lab, while 10% is computational. With virtual cell models, we wish to flip that paradigm.”
How AI finally learned to talk the language of living cells
The announcement represents a big milestone for CZI’s ambitious goal to “cure, prevent, and manage all disease by the top of this century.” Under the leadership of pediatrician Priscilla Chan and Meta CEO Mark Zuckerberg, the $6 billion philanthropic initiative has increasingly focused its resources on the intersection of artificial intelligence and biology.
rBio addresses a fundamental challenge in applying AI to biological research. While large language models like ChatGPT excel at processing text, biological foundation models typically work with complex molecular data that can’t be easily queried in natural language. Scientists have struggled to bridge this gap between powerful biological models and user-friendly interfaces.
“Foundation models of biology — models like GREmLN and TranscriptFormer — are built on biological data modalities, which implies you can’t interact with them in natural language,” Istrate explained. “You have to search out complicated ways to prompt them.”
The recent model solves this problem by distilling knowledge from CZI’s TranscriptFormer — a virtual cell model trained on 112 million cells from 12 species spanning 1.5 billion years of evolution — right into a conversational AI system that researchers can query in plain English.
The ‘soft verification’ revolution: Teaching AI to think in probabilities, not absolutes
The core innovation lies in rBio’s training methodology. Traditional reasoning models learn from questions with unambiguous answers, like mathematical equations. But biological questions involve uncertainty and probabilistic outcomes that don’t fit neatly into binary categories.
CZI’s research team, led by Senior Director of AI Theofanis Karaletsos and Istrate, overcame this challenge by utilizing reinforcement learning with proportional rewards. Instead of straightforward yes-or-no verification, the model receives rewards proportional to the likelihood that its biological predictions align with reality, as determined by virtual cell simulations.
“We applied recent methods to how LLMs are trained,” the research paper explains. “Using an off-the-shelf language model as a scaffold, the team trained rBio with reinforcement learning, a standard technique during which the model is rewarded for proper answers. But as a substitute of asking a series of yes/no questions, the researchers tuned the rewards in proportion to the likelihood that the model’s answers were correct.”
This approach allows scientists to ask complex questions like “Would suppressing the actions of gene A end in a rise in activity of gene B?” and receive scientifically grounded responses about cellular changes, including shifts from healthy to diseased states.
Beating the benchmarks: How rBio outperformed models trained on real lab data
In testing against the PerturbQA benchmark — a typical dataset for evaluating gene perturbation prediction — rBio demonstrated competitive performance with models trained on experimental data. The system outperformed baseline large language models and matched performance of specialised biological models in key metrics.
Particularly noteworthy, rBio showed strong “transfer learning” capabilities, successfully applying knowledge about gene co-expression patterns learned from TranscriptFormer to make accurate predictions about gene perturbation effects—a very different biological task.
“We show that on the PerturbQA dataset, models trained using soft verifiers learn to generalize on out-of-distribution cell lines, potentially bypassing the necessity to train on cell-line specific experimental data,” the researchers wrote.
When enhanced with chain-of-thought prompting techniques that encourage step-by-step reasoning, rBio achieved state-of-the-art performance, surpassing the previous leading model SUMMER.
From social justice to science: Inside CZI’s controversial pivot to pure research
The rBio announcement comes as CZI has undergone significant organizational changes, refocusing its efforts from a broad philanthropic mission that included social justice and education reform to a more targeted emphasis on scientific research. The shift has drawn criticism from some former employees and grantees who saw the organization abandon progressive causes.
However, for Istrate, who has worked at CZI for six years, the concentrate on biological AI represents a natural evolution of long-standing priorities. “My experience and work has not modified much. I actually have been a part of the science initiative for so long as I actually have been at CZI,” she said.
The concentration on virtual cell models builds on nearly a decade of foundational work. CZI has invested heavily in constructing cell atlases — comprehensive databases showing which genes are energetic in several cell types across species — and developing the computational infrastructure needed to coach large biological models.
“I’m really excited concerning the work that’s been happening at CZI for years now, because we’ve been increase to this moment,” Istrate noted, referring to the organization’s earlier investments in data platforms and single-cell transcriptomics.
Building bias-free biology: How CZI curated diverse data to coach fairer AI models
One critical advantage of CZI’s approach stems from its years of careful data curation. The organization operates CZ CELLxGENE, considered one of the biggest repositories of single-cell biological data, where information undergoes rigorous quality control processes.
“We’ve generated among the flagship initial data atlases for transcriptomics, and people were generated with diversity in mind to reduce bias by way of cell types, ancestry, tissues, and donors,” Istrate explained.
This attention to data quality becomes crucial when training AI models that would influence medical decisions. Unlike some industrial AI efforts that depend on publicly available but potentially biased datasets, CZI’s models profit from fastidiously curated biological data designed to represent diverse populations and cell types.
Open source vs. big tech: Why CZI is freely giving billion-dollar AI technology without spending a dime
CZI’s commitment to open-source development distinguishes it from industrial competitors like Google DeepMind and pharmaceutical firms developing proprietary AI tools. All CZI models, including rBio, are freely available through the organization’s Virtual Cell Platform, complete with tutorials that may run on free Google Colab notebooks.
“I do think the open source piece may be very essential, because that’s a core value that we’ve had since we’ve began CZI,” Istrate said. “One of the important goals for our work is to speed up science. So all the things we do is we have the desire to make it open source for that purpose only.”
This strategy goals to democratize access to stylish biological AI tools, potentially benefiting smaller research institutions and startups that lack the resources to develop such models independently. The approach reflects CZI’s philanthropic mission while creating network effects that would speed up scientific progress.
The end of trial and error: How AI could slash drug discovery from a long time to years
The potential applications extend far beyond academic research. By enabling scientists to quickly test hypotheses about gene interactions and cellular responses, rBio could significantly speed up the early stages of drug discovery — a process that typically takes a long time and costs billions of dollars.
The model’s ability to predict how gene perturbations affect cellular behavior could prove particularly beneficial for understanding neurodegenerative diseases like Alzheimer’s, where researchers have to discover how specific genetic changes contribute to disease progression.
“Answers to those questions can shape our understanding of the gene interactions contributing to neurodegenerative diseases like Alzheimer’s,” the research paper notes. “Such knowledge may lead to earlier intervention, perhaps halting these diseases altogether someday.”
The universal cell model dream: Integrating every kind of biological data into one AI brain
rBio represents step one in CZI’s broader vision to create “universal virtual cell models” that integrate knowledge from multiple biological domains. Currently, researchers must work with separate models for various kinds of biological data—transcriptomics, proteomics, imaging—without easy ways to mix insights.
“One of the grand challenges in constructing these virtual cell models and understanding cells, as I discussed over the past couple over the subsequent couple of years, is find out how to integrate knowledge from all of those super powerful models of biology,” Istrate said. “The important challenge is, how do you integrate all of this information into one space?”
The researchers demonstrated this integration capability by training rBio models that mix multiple verification sources — TranscriptFormer for gene expression data, specialized neural networks for perturbation prediction, and knowledge databases like Gene Ontology. These combined models significantly outperformed single-source approaches.
The roadblocks ahead: What could stop AI from revolutionizing biology
Despite its promising performance, rBio faces several technical challenges. The model’s current expertise focuses totally on gene perturbation prediction, though the researchers indicate that any biological domain covered by TranscriptFormer could theoretically be incorporated.
The team continues working on improving the user experience and implementing appropriate guardrails to forestall the model from providing answers outside its area of experience—a standard challenge in deploying large language models for specialised domains.
“While rBio is prepared for research, the model’s engineering team is constant to enhance the user experience, since the flexible problem-solving that makes reasoning models conversational also poses quite a lot of challenges,” the research paper explains.
The trillion-dollar query: How open source biology AI could reshape the pharmaceutical industry
The development of rBio occurs against the backdrop of intensifying competition in AI-driven drug discovery. Major pharmaceutical firms and technology firms are investing billions in biological AI capabilities, recognizing the potential to remodel how medicines are discovered and developed.
CZI’s open-source approach could speed up this transformation by making sophisticated tools available to the broader research community. Academic researchers, biotech startups, and even established pharmaceutical firms can now access capabilities that might otherwise require substantial internal AI development efforts.
The timing proves significant because the Trump administration has proposed substantial cuts to the National Institutes of Health budget, potentially threatening public funding for biomedical research. CZI’s continued investment in biological AI infrastructure could help maintain research momentum during times of reduced government support.
A brand new chapter within the race against disease
rBio’s launch marks greater than just one other AI breakthrough—it represents a fundamental shift in how biological research could possibly be conducted. By demonstrating that virtual simulations can train models as effectively as expensive laboratory experiments, CZI has opened a path for researchers worldwide to speed up their work without the standard constraints of time, money, and physical resources.
As CZI prepares to make rBio freely available through its Virtual Cell Platform, the organization continues expanding its biological AI capabilities with models like GREmLN for cancer detection and ongoing work on imaging technologies. The success of the soft verification approach could influence how other organizations train AI for scientific applications, potentially reducing dependence on experimental data while maintaining scientific rigor.
For a corporation that began with the audacious goal of curing all diseases by the century’s end, rBio offers something that has long eluded medical researchers: a approach to ask biology’s hardest questions and get scientifically grounded answers within the time it takes to type a sentence. In a field where progress has traditionally been measured in a long time, that sort of speed could make all of the difference between diseases that outline generations—and diseases that grow to be distant memories.

