# Sampled social networks and household surveys

People learn from others. Describing this process of learning is of interest to economists in answering many policy questions — how can new health/agricultural technologies be optimally disseminated? how can governments promote prosocial behaviors? More broadly, it is in fact hard to think about a question in economics that has nothing to do with interactions (including learning) in social networks (recent applications to development economics are reviewed here). To analyze social networks, economists model them as a set of relationships between pairs of individuals, where pairs who are connected in the network are more closely linked than pairs who are not connected. For example, in the figure below (from this review paper), individuals ("nodes") are circles, and pairs of individuals ("edges") are connected with a line if those pairs are connected, and are not connected with a line otherwise.

For concreteness, we’ll work with an example where we’re studying communities of farmers, and we’re interested in understanding why some communities have widely adopted a recently introduced agricultural technology, and other communities have not. One might imagine that in communities with more social connections between farmers, information about the new technology spreads more quickly, and farmers are more likely to adopt the technology. We can then ask — are farmers in communities that are more socially connected more likely to adopt the technology?

In practice, measuring these social connections is a hard problem — in a community with 100 farmers, there are 4,950 possible connections between pairs of farmers that we would need to capture! Capturing all these possible connections is an intense demand on the surveys that are commonly used to study households in development economics. As a result, development economists studying networks resort to using sampled network data — only a fraction of the large number of possible connections may be captured from a survey (tools for the analysis of sampled network data are discussed here). The figure below (from this paper) presents a stylized representation of the two most common forms of sampled network data used in development — "induced subgraphs" (where all connections between pairs of survey respondents are measured) and "star subgraphs" (where all connections, where at least one member of the pair is a survey respondent, are measured).

In our example, relying on sampled network data introduces a problem: the connectedness of each community’s social network of farmers is now measured with error, because the surveys we rely on only capture some of the potential connections between farmers. This introduces attenuation bias when we try to estimate whether the technology spreads more quickly in communities that are more connected, because our right hand side (connectedness) is measured with error. AHHHH!?!?! Good news — a number of solutions have been developed to remove this bias when working with sampled social network data!

## Survey everyone!

First, one can avoid the problem altogether, and simply measure every possible social connection within the community! A recent paper by Beaman et al (2018) does exactly this — in order to study the diffusion of a new agricultural practice, they conducted a full census of 200 communities, and surveyed each household on their social connections within the community. This was particularly important in their context, as they were interested in studying whether households needed multiple connections who could introduce the practice to them in order to adopt the practice.

## Don’t worry

Sometimes social network information is needed, but not the full structure of the social network. A recent paper by BenYishay et al (2020) explores how perceptions of women farmers and, more broadly, women in leadership roles, impacts the effectiveness of women at promoting a new agricultural practice. This requires measuring many important characteristics of the social network — these include how farmers perceive women’s agricultural competence, and how perceptions of competence are affected when the women take on a leadership role. Estimating these does not require full network data — it is sufficient to elicit perceptions of a representative sample of female farmers, or simply to elicit perceptions of the specific woman who takes on the leadership role.

## Empirical Bayes

Alternatively, there are simple statistical approaches to correct for measurement error in a regression, including for some characteristics of networks.

The general idea is as follows: we care about a variable \( x_{i} \) (e.g., connectedness of the network), but instead observe that variable measured with error, \( x_{*i} \) (e.g., connectedness of a subset of the network). We are interested in how an outcome \( y_{i} \) varies with \( x_{i} \) (e.g., how technology adoption varies with the connectedness of the network).

\[ \begin{array}{c} y_{i} = \beta x_{i} + \epsilon_{i} \\ x_{*i} = x_{i} + \eta_{i} \\ \epsilon_{i} \perp x_{i}, \eta_{i} \perp (x_{i}, \epsilon_{i}) \end{array} \]

In this case, a regression of \( y_{i} \) on \( x_{i} \) is infeasible, and using \( x_{*i} \) instead yields a biased estimate (because \( x_{*i} \) is measured with error). To solve this, we can take an Empirical Bayes approach and instead use what we expect the true value \( x_{i} \) to be given our noisy observed value \( x_{*i} \).

\[ \beta = \text{Cov}(y_{i}, \mathbf{E}[x_{i} | x_{*i}]) / \text{Var}(\mathbf{E}[x_{i} | x_{*i}]) \]

Chandrasekhar & Lewis (2016), in addition to providing a general discussion of working with sampled network data, provide formulas for the error for some statistics of networks (e.g., average number of connections, clustering, estimated peer effects). These typically involve two step estimators, where a first step characterizes the distribution of error and the second step adjusts the right hand side given estimated characteristics of the distribution of measurement error. Two nice recent extensions of this approach are Griffith (2020) (who applies it to estimating peer effects) and Hardy et al. (2020) (who apply it to estimating spillovers from treatment).

## Bayes

Alternative approaches to capture more complicated network characteristics include Bayesian approaches. However, these approaches require the researcher to specify prior beliefs on the full set of parameters in their model, including of how the network is formed. This is potentially challenging conceptually and computationally; however, a pair of papers (Breza et al, 2019a; Breza et al, 2019b) propose feasible Bayesian approaches to realistically model a number of features of real life networks, including heterogeneity in clustering and propensity to form connections across individuals with different characteristics.

Interestingly, these approaches can be implemented with a reduced version of a sampled network, aggregated relational data (ARD). With ARD data, rather than having all of the names of individuals a given person is connected to, we instead only need the number of individuals with certain characteristics that a given person is connected to. For example, a survey collecting ARD data would ask a series of questions like "How many people do you know who own a bicycle?", and "How many people do you know who have been arrested?".

Further, they highlight that in many contexts, ARD data can be meaningfully cheaper to collect than traditional social network data.

## Regularization

Alternatively, one can approach this problem of making use of a sampled social network by trying to predict the missing links in the network that we don’t observe using the links that we do observe. This problem can be seen as high dimensional: for each connection between two individuals, we have a large number of possible predictors (all of the connections of each individual) that we could use to predict whether those two individuals are connected.

Like most problems with a large number of potential controls, machine learning solutions are possible! A recent paper (Alidaee et al, 2020), building on Negahban and Wainwright (2011), proposes machine learning tools to estimate a model of social network formation.

The key idea is to reduce the dimensionality of the problem. The tools are similar to those used for the Netflix problem: just as we can think about Netflix users as having a small number of unobserved characteristics that govern their tastes for movies (e.g., action fans and comedy fans), we can think about individuals as having a small number of unobserved characteristics that govern which other individuals they link with (e.g., football fans and running nerds).

Finally, Alidaee et al 2020 demonstrate that these tools can be used on the same ARD data that Bayesian approaches can take advantage of, rather than needing to observe the sampled network directly.

Victory lap

With the development of these new tools, sophisticated analyses that were once only possible with a complete mapping of a social network (very expensive! difficult to collect!) are now possible with only a sample of the network, or even with a representation of the network that contains no network data (i.e., ARD)! This makes it possible for development economists running household surveys on tight budgets to answer questions requiring social network data.

## Join the Conversation