Thursday, July 16, 2020

Interview with Melissa Dell: Persistence Across History [feedly]

The new John Bates Clark winner -- nearly as prestigious as the Nobel in Economics. The historical data mining technique she discusses is fascinating: its goal is to do textual analysis, yielding semantic, and sentimental context  through recorded history. The potential is huge for many disciplines in the social sciences

Interview with Melissa Dell: Persistence Across History

https://conversableeconomist.blogspot.com/2020/07/interview-with-melissa-dell-persistence.html

Tyler Cowen inteviews Melissa Dell, the most recent winner of the Clark medal(which "is awarded annually .. to that American economist under the age of forty who is judged to have made the most significant contribution to economic thought and knowledge). Both audio and a transcript of the one-hour conversation are available. From the overview: 
Melissa joined Tyler to discuss what's behind Vietnam's economic performance, why persistence isn't predictive, the benefits and drawbacks of state capacity, the differing economic legacies of forced labor in Indonesia and Peru, whether people like her should still be called a Rhodes scholar, if SATs are useful, the joys of long-distance running, why higher temps are bad for economic growth, how her grandmother cultivated her curiosity, her next project looking to unlock huge historical datasets, and more.
Here, I'll just mention a couple of broad points that caught my eye. Dell specializes in looking at how conditions in at one point in time--say, being in an area which for a time has strong centralized tax-collecting government--can have persistent effects on economic outcomes decades or even centuries later. For those skeptical of such effects, Dell argues that explaining, say, 10% of a big difference between two areas is a meaningful feat for social science. She says: 
I was presenting some work that I'd done on Mexico to a group of historians. And I think that historians have a very different approach than economists. They tend to focus in on a very narrow context. They might look at a specific village, and they want to explain a hundred percent of what was going on in that village in that time period. Whereas in this paper, I was looking at the impacts of the Mexican Revolution, which is a historical conflict in economic development. And this historian, who had studied it extensively and knows a ton, was saying, "Well, I kind of see what you're saying, and that holds in this case, but what about this exception? And what about that exception?"

And my response was to say my partial R-squared, which is the percent of the variation that this regression explains, is 0.1, which means it's explaining 10 percent of the variation in the data. And I think, you know, that's pretty good because the world's a complex place, so something that explains 10 percent of the variation is potentially a pretty big deal.

But that means there's still 90 percent of the variation that's explained by other things. And obviously, if you go down to the individual level, there's even more variation there in the data to explain. So I think that in these cases where we see even 10 percent of the variation being explained by a historical variable, that's actually really strong persistence. But there's a huge scope for so many things to matter.

I'll say the same thing when I teach an undergrad class about economic growth in history. We talk about the various explanations you can have: geography, different types of institutions, cultural factors. Well, there's places in sub-Saharan Africa that are 40 times poorer than the US. When you have that kind of income differential, there's just a massive amount of variation to explain.

Nathan Nunn's work on slavery and the role that that plays in explaining Africa's long-run underdevelopment — he gets pretty large coefficients, but they still leave a massive amount of difference to be explained by other things as well, because there's such large income differences between poor places in the world and rich places. I think if persistence explains 10 percent of it, that's a case where we see really strong persistence, and of course, there's other cases where we don't see much. So there's plenty of room for everybody's preferred theory of economic development to be important just because the differences are so huge.
Dell also discusses a project to organize historical data, like old newspapers, in ways that will make them available for empirical analysis.  She says: 
I have a couple of broad projects which are, in substance, both about unlocking data on a massive scale to answer questions that we haven't been able to look at before. If you take historical data, whether it be tables or a compendia of biographies or newspapers, and you go and you put those into Amazon Textract or Google Cloud Vision, it will output complete garbage. It's been very specifically geared towards specific things which are like single-column books and just does not do well with digitizing historical data on a large scale. So we've been really investing in methods in computer vision as well as in natural language processing to process the output so that we can take data, historical data, on a large scale. These datasets would be too large to ever digitize by hand. And we can get them into a format that can be used to analyze and answer lots of questions.

One example is historical newspapers. We have about 25 million-page scans of front pages and editorial pages from newspapers across thousands and thousands of US communities. Newspapers tend to have a complex structure. They might have seven columns, and then there's headlines, and there's pictures, and there's advertisements and captions. If you just put those into Google Cloud Vision, again, it will read it like a single-column book and give you total garbage. That means that the entire large literature using historical newspapers, unless it uses something like the New York Times or the Wall Street Journal that has been carefully digitized by a person sitting there and manually drawing boxes around the content, all you have are keywords.

You can see what words appear on the page, but you can't put those words together into sentences or into paragraphs. And that means we can't extract the sentiment. We don't understand how people are talking about things in these communities. We see what they're talking about, what words they use, but not how they're talking about it.

So, by devising methods to automatically extract that data, it gives us a potential to do sentiment analysis, to understand, across different communities in the US, how people are talking about very specific events, whether it be about the Vietnam War, whether it be about the rise of scientific medicine, conspiracy theories — name anything you want, like how are people in local newspapers talking about this? Are they talking about it at all?

We can process the images. What sort of iconic images are appearing? Are they appearing? So I think it can unlock a ton of information about news.

We're also applying these techniques to lots of firm-level and individual-level data from Japan, historically, to understand more about their economic development. We have annual data on like 40,000 Japanese firms and lots of their economic output. This is tables, very different than newspapers, but it's a similar problem of extracting structure from data, working on methods to get all of that out, to look at a variety of questions about long-run development in Japan and how they were able to be so successful. 

 -- via my feedly newsfeed

No comments: