The types of data available to development economists are proliferating – multi-topic household surveys are almost passé today but 25 years ago it was a rare privilege to be able to correlate economic measures of the household with other indicators such as health or community infrastructure. Not only are surveys more sophisticated, and arguably contain less error due to the use of field based computers, but the digital revolution has multiplied the types of data at our beck and call. For example, the Billion Prices Project at MIT collects high-frequency online retail price data from all over the world. Another data type that promises new insights is text based data – the utility of text now extends beyond mere reading. Text can also be analyzed with ease.
One of the first examples from economists of text based analysis is this working paper by Martin Ravallion who uses the Google Books Ngram Viewer to explore the evolution of public awareness and attention to poverty issues. The Ngram viewer can search for the frequency of words or phrases in over 5.2 million published books stretching from the 16th century to the present. Martin identifies two historical epochs of poverty awareness – we are in the midst of the second epoch that began around 1960 – and he finds it is only in that second epoch when have writers begun to imagine and discuss the elimination of poverty.
Another digital benefit is that it is now much easier to calculate a text’s difficulty of comprehension. The Flesch-Kincaid Grade Level is one standard comprehension measure that assesses the difficulty of a text, scaled to U.S. educational grade level norms. The Flesch-Kincaid measure is essentially a linear transformation of two textual indicators that convey complexity: the ratio of total words to total sentences and the ratio of total syllables to total words.
How can this measure be used to help understand or inform economic development? Well … I don’t yet exactly know! But to start, I imagine it can be useful for studies of political economy. To give one simple example, I’ve constructed pictures of the change in the complexity of language in the U.S. Presidents’ State of the Union address over the period 1934 – 2011. This is a major yearly political address where the U.S. president lays out his administrative and legislative goals for the next 12 months.
Here is the Flesch-Kincaid Grade Level measure:
The black line is a fitted polynomial trend line.
And here is a related measure, the average number of words per sentence:
Again the black line is a fitted polynomial trend line.
Both measures clearly indicate a decline in the complexity of political speech (the underlying data for these figures can be downloaded here). This decline undoubtedly reflects changes in the beliefs of political advisors and speech writers over what constitutes effective political communication. I’ll leave it to the readers to speculate on the causes of such a change, although I’ll note a profound transformation in the medium of communication over the period when these indices were falling: television use increased from 9% of all US households in 1950 to 90% in 1962 and near universal ownership by 1978.
The figures convey that sentence structure in the State of the Union has been simplified (I’d imagine most likely through the abandonment of clauses and qualifiers) and that words of fewer syllables more regularly used. Assuming that these speeches are representative of the complexity of general political speech, does this decline reflect changes in the quality of policy or merely changes in the form of rhetoric? Again, I don’t have an answer.
I worry, however, if this decline hinders the ability to respond to contemporary policy challenges – for example can the complexity and nuance of what would constitute effective economic policy during a period of widespread debt deleveraging be adequately conveyed through relatively simple speech? Is the public well served by increasing simplicity of political speech?
Here is yet one more tool of text analysis: a computerized program that determines the propositional idea density (P-density) of a text. The P-density is essentially a measure of how much information is conveyed relative to the number of words used to express those ideas or information. A high score reflects an efficient expression while low scores likely reflect vague and repetitive language. If education is meant to improve reasoning and result in efficient expression, perhaps P-density will be a useful outcome measure for quality of education studies (as may the other measures above).
Of course there are plenty of challenges with this data type. One of the foremost problems is that these tools are well-defined for English language texts but not yet for the dominant language in many developing countries. However the state of tools and analysis today most likely just scratches the surface of what is to come.
Join the Conversation