HomeNewsNew method improves the reliability of statistical estimates

New method improves the reliability of statistical estimates

Suppose an environmental scientist is studying whether exposure to air pollution is related to lower birth weights in a specific county.

You could train a machine learning model to estimate the extent of this relationship, since machine learning methods are particularly good at learning complex relationships.

Standard machine learning methods are great for making predictions and sometimes provide uncertainties comparable to confidence intervals for those predictions. However, they often don’t provide estimates or confidence intervals when determining whether two variables are related. Other methods have been specifically developed to deal with this association problem and supply confidence intervals. But in spatial environments, MIT researchers found that these confidence intervals could be way off base.

When variables comparable to air pollution or precipitation change in several locations, common methods of generating confidence intervals can claim high levels of confidence when in point of fact the estimate did not capture the true value in any respect. These erroneous confidence intervals can mislead the user into trusting a failing model.

After identifying this shortcoming, the researchers developed a brand new method designed to generate valid confidence intervals for problems involving spatially various data. In simulations and experiments with real data, their method was the one technique that consistently generated accurate confidence intervals.

This work could help researchers in fields comparable to environmental science, economics and epidemiology higher understand when to trust the outcomes of certain experiments.

“There are so many problems where individuals are keen on understanding phenomena in space, comparable to weather or forest management. We have shown that for this broad class of problems there are more appropriate methods that allow us to realize higher performance, higher understanding of what is happening, and more trustworthy results,” says Tamara Broderick, associate professor in MIT's Department of Electrical Engineering and Computer Science (EECS), a member of the Laboratory for Information and Decision Systems (LIDS) and the Institute for Data, Systems, and Society, an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead writer of this book study.

Broderick is supported within the work by co-lead authors David R. Burt, a postdoctoral researcher, and Renato Berlinghieri, an EECS doctoral student. and Stephen Bates, assistant professor of EECS and member of LIDS. The research was recently presented on the Conference on Neural Information Processing Systems.

Invalid assumptions

Spatial association examines how a variable and a selected final result are related in a geographic area. For example, one might wish to examine how tree cover is expounded to elevation within the United States.

To solve this sort of problem, a scientist could collect observational data from many locations and use it to estimate the connection at one other location where no data exists.

The MIT researchers realized that on this case, existing methods often generate completely incorrect confidence intervals. A model might say that it’s 95 percent confident that its estimate captures the true relationship between tree cover and height, although it has not captured that relationship in any respect.

After studying this problem, the researchers discovered that the assumptions on which these confidence interval methods are based don’t delay when the info varies spatially.

Assumptions are like rules that have to be followed to be certain that the outcomes of a statistical evaluation are valid. Common methods for generating confidence intervals are based on various assumptions.

First, they assume that the source data, i.e. the observational data collected to coach the model, is independent and identically distributed. This assumption implies that the flexibility to incorporate one location in the info doesn’t affect whether one other location is included. But, for instance, the U.S. Environmental Protection Agency's (EPA) air sensors are placed with other air sensor sites in mind.

Second, existing methods often assume that the model is totally correct, but this assumption never holds in practice. Finally, they assume that the source data is comparable to the goal data to be estimated.

However, in spatial environments, the source data could be fundamentally different from the goal data since the goal data is in a special location than where the source data was collected.

For example, a scientist could use data from EPA pollution monitors to coach a machine learning model that may predict health outcomes in a rural area where there aren’t any monitors. But the EPA pollution monitors will likely be placed in urban areas where there may be more traffic and heavy industry, so the air quality data can be very different from the air quality data in rural areas.

In this case, association estimates using town data suffer from bias since the goal data systematically differs from the source data.

A seamless solution

The recent method for generating confidence intervals explicitly takes this potential bias under consideration.

Instead of assuming that source and goal data are similar, researchers assume that the info varies uniformly in space.

For example, in relation to air pollution from particulate matter, one wouldn’t expect the pollution levels in a single city block to be very different from the pollution levels in the following city block. Instead, pollution levels would gently decrease as one moves away from a pollution source.

“For some of these problems, this spatial smoothness assumption is more appropriate. It suits higher with what's actually occurring in the info,” says Broderick.

When they compared their method to other popular techniques, they found that it was the just one that would consistently produce reliable confidence intervals for spatial analyses. Furthermore, their method stays reliable even when the observational data is distorted by random errors.

In the long run, researchers would love to use this evaluation to various kinds of variables and explore other applications where it could provide more reliable results.

This research was funded partially by a seed grant from the MIT Social and Ethical Responsibilities of Computing (SERC), the Office of Naval Research, Generali, Microsoft, and the National Science Foundation (NSF).

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read