Network analysis is a burgeoning sub-field in development economics as more and more attention is paid to how individual preferences and behaviors are influenced by decisions in the wider community. One example is the 2007 Kremer and Miguel paper that explores the determinants of take-up of deworming medicine by regressing take-up on the number of connections that the household has with other treated households. Network data is the representation of these connections between households or individuals. In network jargon the study unit is a “node” in the network and the connections between the study units are called “edges”. Typically both the number of nodes and the number of edges observed in data are fractional samples from all that actually exist in the population. This is critical to note as there are unique inferential problems when not all of the existing edges are observed by the researcher.

A very recent working paper by Arun Chandrasekhar and Randall Lewis (it is available in draft form but, as caveated on the cover, should still be considered preliminary) lays out the econometric challenges when analyzing data from sampled networks.

The typical way that networks are measured through survey is by asking the individual node to list other nodes with which that person interacts. Usually the survey does not have the time and resources to record an exhaustive list of all connections between the individual and other nodes, instead it records a sample of them (i.e. “list up to 5 friends that you discuss life with”, “list up to 4 relatives that you go to market with”, etc.). As Chandrasekhar and Lewis determine through a review of applied network studies, the median percentage of connections sampled is 25% and two-thirds of the studies sample less than 50% of existing network connections. Therein lies the rub – analysis of these partial networks can result in substantial bias in the estimated parameters of interest, depending on the network measure utilized.

Various summary network measures capture the network influence on economic behaviors and outcomes. One is the “Degree” of a node, defined as the number of connections the node has and hence is a summary measure of connectedness. The Kremer and Miguel paper above regresses medicine take-up in the household as a function of household Degree (in this case the number of connections between the household and other treated units). “Clustering” is another network measure, defined as the fraction of a node’s neighbors that are themselves connected to each other. Clustering captures the level of interconnectedness of a group of nodes and has sometimes been taken as a measure of social capital as in this study by Karlan, Mobius, Rosenblat, and Szeidl.

Through numeric simulations over a wide range of scenarios, the authors Chandrasekhar and Lewis find that the bias in the Degree and Clustering measures are substantial when working with a 25% network sample, i.e. when only one-quarter of all existing connections are observed by the researchers. In this case the regression coefficient on the Degree measure is biased downward by an average of 58% of the true value, while the Clustering measure coefficient is biased downward by an average of 93%. Of course as the sampled network increases – as more and more connections are observed by the researchers – the bias diminishes. But even when two-thirds of the network is sampled, the downward biases in the two coefficients are 11 and 33% respectively.

Fortunately, the authors develop two strategies to alleviate the bias. The first involves analytic corrections to the more commonly used network statistics. The second strategy is a two-step imputation process of missing network information they term Graphical Reconstruction. In practice this approach is analogous to poverty mapping methods and other methods of missing data imputation. Graphical Reconstruction first estimates a statistical model of network connection on the basis of the observed edges and the observable covariates in the data, and then applies this model to impute all possible connections between all observed nodes. Chandrasekhar and Lewis find that these two proposed correction strategies substantially reduce the biases in the simulation exercises, especially the Graphical Reconstruction method.

The authors conclude their thoughts with an interesting guideline to optimizing the collection of sampled network data on a fixed budget – a topic of import to anyone collecting network data in the field. This exercise, akin to a more traditional power analysis, attempts to determine the optimal tradeoff between the number of networks sampled and the extent of each network sampled given the inherent limitations with a fixed budget for data collection. Essentially the approach recommends spending some of the data budget collecting the full networks for a handful of pilot villages and then simulating the outcome data from various combinations of network sampling intensity and the total number of sampled villages and individuals. If time and resources permit, this exercise can be a critical step in assuring a high quality network study.

Network analysis is a relatively new area of applied econometrics and much more needs to be understood, however the known pitfalls are potentially severe enough to consider the remedies discussed here.