In low-income countries, agricultural production and productivity of family farms have direct consequences for income, food security and nutrition outcomes at the household level. The data collected on smallholder agricultural activities as part of large-scale household and farm surveys are, therefore, central to design policies that aim to increase agricultural productivity through the promotion of modern farm inputs and climate-smart agriculture practices, among others.
Yet, accurate measurement of crop yields —a key indicator of agricultural productivity— remains a challenge for smallholder farms. Reliance on farmer-reported data on crop production and cultivated areas remains the most common approach to estimating crop yields in large-scale surveys. However, recent research efforts —from Ethiopia, Mali, and Uganda— have shown that farmer-reported crop yields are prone to significant and systematic measurement errors.
The alternative and objective approach to crop yield estimation is crop cutting. This method requires demarcating a randomly selected portion of a plot, for example, a 4x4 meter area, followed by harvesting and weighing the crop within this area to estimate the crop yield. The adoption of crop cutting, however, remains limited in large-scale surveys implemented in low-income countries, given its logistical complexity, significant supervision requirements and, thereby, high costs.
The solution: Machine learning and data integration
In a new study by the Living Standards Measurement Study (LSMS) —the World Bank’s flagship household survey program— the team explores whether it is possible to rely on machine learning (ML) and data integration to impute “missing” crop cut yields on smallholder farms, when crop cutting may be adopted by a survey implementer but limited to a subsample of plots due to budget and logistical constraints.
Imputation refers to the process of predicting missing data. In this case, the data available for crop yields is used to build a model that estimates the data that could not be collected.
Our research leverages unique data from two consecutive rounds of the national agricultural survey in Mali - one of the few surveys in Africa that implements crop cutting for an extensive range of crops, including millet, sorghum, maize, rice, cowpea and groundnut.
For each crop, we build a predictive ML model of observed crop cut yields by using only a portion of the plots that were subject to crop cutting during the fieldwork: the training sample. The predictor variables featured in the model include farmer-reported crop yields and plot characteristics that are elicited in the survey, as well as geospatial variables, such as rainfall and soil quality, that are derived for the georeferenced plot locations.
In turn, we obtain “imputed” crop cut yields for the remaining crop cut sample that we exclude from model training: the test sample. The comparison of observed versus imputed yields in the test sample helps us answer the research question of interest.
Four key findings
- Farmer-reported crop yield emerges as a key predictor, despite its shortcomings: While farmer-reported crop yield may be subject to biases, it still plays a significant role in predicting crop cut yields. Moreover, the models performed better for crops with low intercropping rates and high commercialization rates, i.e., the crops that farmers may be better positioned to report more accurate production information on.
- Geospatial data boost prediction accuracy: Including geospatial predictors, such as rainfall, elevation and distance to markets, significantly improves the accuracy of imputed crop cut yields. These variables provide objective data that capture environmental and location-specific factors influencing crop productivity.
- Imputation works best within the same survey round: The imputed crop cut yields are more accurate when we predict the missing data within the same survey round. When applying the models to predict yields across different survey rounds (i.e., using data from the 2017 survey to predict 2018 yields), the results were less accurate. This suggests that the year-to-year variability in crop production—driven by factors such as weather and farming practices—makes it difficult to generalize predictions across different seasons.
- Limiting crop cutting to a modest subsample of plots can be sufficient for model training: For most crops, machine learning models generated yield estimates that closely matched those from crop cutting, even when using a small subsample of the crop cut data. Conducting crop cutting at a minimum for 1/3 of the sample, and more optimally for 50 percent of the sample, can offer a cost-effective approach while achieving reliable ML predictions of crop cut yields. This has significant implications for reducing costs in future surveys by limiting the need for extensive crop cutting.
Figure 1. Crop-cut (red), machine-learning (green) and self-reported (blue) yield means at the national and regional levels in 2017.
Implications for future survey design
Our findings have important implications for the design of agricultural surveys in low-income countries. The ability to predict crop yields using machine learning and data integration can significantly reduce the costs associated with conducting large-scale surveys.
By conducting crop cutting on a modest subsample of plots and imputing the missing data for the rest, policymakers and researchers can still obtain reliable yield statistics while conserving resources.
Finally, this approach can be particularly valuable in areas difficult to access, where traditional crop cutting may be impossible. In such contexts, imputation methods using machine learning offer a practical alternative for maintaining data continuity and supporting evidence-based decision-making.
The paper is available to download here: Yielding Insights: Machine Learning-Driven Imputations to Filling Agricultural Data Gaps.
Join the Conversation