Great to read this post. I've been worried about this issue since I read the interesting discussion in this post:
With only two states, one cannot separately identify treatment and state effects, as Prof Imbens points out.
For more information about selecting prior distributions for the state effects variance parameters, see "Prior distributions for variance parameters in hierarchical models" A Gelman, Bayesian Analysis, 2006, 1(3) p.515-533. With small numbers of states it is better to use Uniform priors on these variance parameters, if one wants a noninformative prior.
I think that the multilevel modeling approach is to model until units are exchangeable. In other words, in Dan Killian's example, one would add levels for both the villages and districts, because villages may not be considered exchangeable without conditioning on district. I think that framing the question of "when to cluster" in terms of exchangeability is useful. This is discussed a lot in both the Gelman-Hill book cited in this post, and Bayesian Data Analysis by Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin.