Attend Spring Meetings on development topics from Apr 18-23. Comment and engage with experts. Calendar of Events

Syndicate content

Tools of the Trade: Intra-cluster correlations

David McKenzie's picture

In clustered randomized experiments, random assignment occurs at the group level, with multiple units observed within each group. For example, education interventions might be assigned at the school level, with outcomes measured at the student level, or microfinance interventions might be assigned at the savings group level, with outcomes measured for individual clients.

A key parameter in these experiments is the intracluster correlation, which measures the proportion of the overall variance in the outcome which is explained by within group variance. Consider, for example, a sample of 2000 individuals, divided into 100 groups of 20 each (e.g. 100 classes each of 20 students). When the intracluster correlation is 0, individuals within classes are no more similar than individuals in different classes, and it is as if you effectively assigned 2000 individuals to treatment or control. When the intracluster correlation is 1, everyone within a class acts the same, and so you effectively only have 100 independent observations. This graph (made in Optimal Design), shows how the power of a study for detecting a treatment effect of 0.2 standard deviations (delta =0.2), and of 0.5 standard deviations (delta = 0.5) as a function of the intracluster correlation. You see for the smaller effect size that power falls dramatically as the intracluster correlation increases.

As a result, when doing power calculations for cluster randomized trials, it is important to know what the likely intracluster correlation will be for your study. Most of my experiments have been randomized at the individual level, and the few that I have done at a group level have been cases where I haven’t had any baseline data available at the time of doing power calculations, so have had to typically rely on estimates from other studies of what this correlation could be. However, I am currently planning a financial literacy study in which we have the individual savings balances of microfinance group members, and so have the opportunity to actually calculate this for once. I realized I had forgotten how to do this in Stata, but luckily it is very simple. Just use the loneway command. Here is an example, showing my intracluster correlation is 0.13:

. loneway savings group

One-way Analysis of Variance for savings: Savings

Number of obs = 3535
R-squared = 0.1796

Source SS df MS F Prob > F

Between group 77952412 194 401816.56 3.77 0.0000
Within group 3.562e+08 3340 106635.14

Total 4.341e+08 3534 122839.21

Intraclass Asy.
correlation S.E. [95% Conf. Interval]
0.13258 0.01713 0.09901 0.16616

Estimated SD of group effect 127.6682
Estimated SD within group 326.5504
Est. reliability of a group mean 0.73462
(evaluated at n=18.11)

As a result, with 20 individuals per microfinance group, standard errors will be approximately 1.86 as large as if I had individual randomization (see equation 11 on page 3922 of the Duflo et al, randomization toolkit for this formula), which should in our case still leave sufficient power to detect effect sizes we are interested in.

This should at least remind me of this command next time I forget it. Let us know if there are any other practical “how to do this in Stata?” questions you might have.


Submitted by Anonymous on
G*Power is also an nice source for power calculations, as are the Stata add-in commands rdpower, and pwploti (sampsi is built in).

Submitted by Michele V on
I am wondering the following, cause I never looked at intra-cluster correlation before. The econometrics practice recommends to cluster the s.e. at the group-level as soon as my explanatory variable of interest varies only at the group level or I suspect that some relevant unobserved factor may be correlated at that level. Does the intra-cluster correlation still matter once I cluster the s.e. in my regression specification? Thanks

Submitted by Anne on
One econometric issue that I have been thinking about is correction for multiple inference. Can you provide guidance and assist with code for Bonferroni adjustments or other corrections when you are doing a large number of hypothesis tests? It seems for regressions of primary importance these techniques are too conservative, but it is becoming a larger issue in terms of data mining.

Submitted by Sean on
I was wondering if you had any guidance as to why you get a discrepancy in the design effect as reported directly by Stata with the svy commands and that calculated from using the intracluster correlation coefficient from loneway and the adjusted mean cluster size. I provide and example below borrowed from an earlier (and unfortunately unanswered) statalist question, to illustrate the problem ( webuse auto7 -svyset,psu(manufacturer_grp)- -svymean mpg- reports the deff as 1.836013 Using -loneway mpg manufacturer_grp- reports the intraclass correlation as 0.36827 A formula to obtain the intraclass correlation coefficient from loneway is rho = (MSB-MSW) / (MSB + (n_o - 1) MSW) = 0.36827 where n_o = 1/(k-1) [N - sum_i (n_i^2/N] = 3.1584767 k = number of groups = 23 N = total sample size = 74 Variance Inflation Factor = 1 + (n_o - 1) rho = 1 + (3.1584767 - 1) * 0.36827 = 1.7949 The 1.7949 derived from -loneway- does not equal the value of 1.836013 that is reported for Deff from -svymean-.

I have no clue what is driving the difference, but given how close the numbers are and issues in degrees of freedom adjustments, it seems possible that one is using (n-1) while the other uses n, or (k-1) while the other uses k, etc.

Submitted by elisabet on

Thank you David, I'd been spending some time searching for a STATA command to compute ICC.

The formula drom the Duflo toolkit you mentioned is an interesting feature as well. However, I don't understand what the 'P' stands for in the formula for the hypothetical standard error. What value did you use to compute the 1.86?

I'm running some regressions with PISA data, so this "proportion of the sample with treatment", as she calls it, is a bit vague as to what it would be in my case.


Rho is the intra-cluster correlation, so in their formula for the design effect: D = sqrt(1+(n-1)*rho))
In my case, D = sqrt (1+(20-1)*0.13) = 1.86

Add new comment