Syndicate content

Is predicted data a viable alternative to real data?

Roy Van der Weide's picture

The primary motivation for predicting data in economics, health sciences, and other disciplines has been to deal with various forms of missing data problems. However, one could also make a case for adopting prediction methods to obtain more cost-efficient estimates of welfare indicators when it is expensive to observe the outcome of interest (in comparison with its predictors). For example, consider the estimation of poverty and malnutrition rates. The conventional estimators in this case require household- and individual-level data on expenditures and health outcomes. Collecting this data is generally costly. It is not uncommon that in developing countries, where poverty and poor health outcomes are most pressing, statistical agencies do not have the budget that is needed to collect these data frequently. As a result, official estimates of poverty and malnutrition are often outdated: For example, across the 26 low-income countries in Sub-Saharan Africa over the period between 1993 and 2012, the national poverty rate and prevalence of stunting for children under five are on average reported only once every five years and once every ten years in the World Development Indicators.

In recent years, a number of studies have explored the option of predicting household expenditure data into existing secondary surveys in an effort to supplement existing poverty estimates and increase their frequency (Stifel and Christiaensen, 2007; Douidich et al., 2015). Douidich et al. (2015) for example considers the Labor Force Survey as their secondary survey, which is often available at a higher frequency than household expenditure surveys.

There is also a large literature that predicts household expenditure and individual health data into the population census, see for example Elbers et al. (2003), Fujii (2010), and Elbers and van der Weide (2014) and the references therein. The objective here is to obtain estimates of welfare at a high level of disaggregation, or at the level of small area such as a district. These small area estimates are often presented in the form of a map known as a poverty map. It would be impractical to use the sample direct estimator because the data must contain an extraordinarily large number of households to obtain a reliable estimate for each small area, which would be financially infeasible.

It is then a small step to purposefully collect data on covariates that are ideally suited for the prediction of household- or individual-level outcomes of interest. If real data on the variable of interest is collected for a sub-sample of households, then this sub-sample can be used to build the model that is used for prediction. The advantage of this approach is that the prediction model will apply to the population of interest by construction. There is certainly a considerable interest in adopting such an approach in practice in the hope that this will enable a meaningful reduction in financial costs.

In our recent study (Fujii and van der Weide, 2016), we determine if meaningful reductions in financial costs can be realized while preserving statistical precision. Specifically, we solve a cost minimization problem subject to a statistical precision constraint and its dual problem of variance minimization problem subject to a budget constraint. This helps us identify the conditions under which the gains from using predicted data are relatively large (and small).

We find that the financial gains tend to be modest. When we calibrate the parameters from the prediction model and the financial cost function to real data in the context of the estimation of poverty, we find that the reductions in costs rarely exceed 25 percent and are often below 10 percent. There are circumstances in which the gains can be more substantial, but we conjecture that these denote the exceptions rather than the rule.

Moreover, we have currently abstracted away from model misspecification error. Ignoring this component of error obviously favors the use of predicted data. Accounting for misspecification error is not obvious; it is hard to quantify since the true model is inherently unknown and any given estimate of the model can be misspecified in infinitely many ways. Consequently, applied users should always bear in mind that estimates derived from predicted data are arguably less precise than is suggested by conventional standard errors.

Given these observations, when new data is to be collected, we recommend that the outcome variable of interest is included so that one does not have to rely on predicted data. This does not mean that there is no role for prediction estimators. Under the right circumstances we believe they could be of great value. For one, prediction estimators provide the means of leveraging already existing data, as is done in Elbers et al. (2003), Fujii (2010), and Douidich et al. (2015). Furthermore, if no previous data exists and the budget is particularly constrained such that one may be left with the choice between predicted data or no data, then the former may be preferred over the latter. In such a data-poor environment, which is not unheard of in developing countries, prediction estimators may continue to provide a valuable option.

Add new comment