Over the last three days I have been involved in a series of workshops, which we ran right here in Aarhus. The first day was a public workshop on Digital & Computational Humanities – Tools & Thoughts.
Participants in the Digital Humanities workshop. Photograph courtesy of Lars Pallesen, Communications Officer, Interacting Minds Centre, AU
The central focus of this workshop was on the use of the databases in humanities. Pieter Francois (Oxford) gave a general introduction to the topic and discussed two examples. Then Kevin Feeny and Rob Brennan (Trinity College, Dublin) described a new and powerful approach for building and publishing social-science data sets (more on it below). In my talk I urged all to think beyond gathering data and visualizing patterns. Much more can be done with the data – such as using them to test theories against each other, so that we could make progress rejecting some and supporting others. The evolutionary literary scholar Joe Carroll (St. Louis) gave a talk illustrating how a statistical analysis of 200 Victorian novels can allow us to test evolutionary theories about human nature. There were also very good talks about agent-based models and how we can understand Horror Fiction from the point of view of evolutionary theory.
According to the research by Joe Carroll and colleagues, Count Dracula scored the highest on the ‘interest scale’ among all characters of Victorian novels, good and bad (that is, who would you notice immediately, and start paying attention to, if he/she entered the room?)
It was both interesting and great fun. Around 40 people attended, and they had to bring in extra chairs to accommodate all. About three-quarters were humanists (according to the “raise-your-hand” poll that I conducted at the beginning of my presentation).
The next two days we met in a small group (just five people). I call this way of working ‘micro-workshops,’ and I find that they are a very efficient way of getting things done. This micro-workshop focused specifically on Seshat: the Global History Databank, about which I have already written in a series of blogs (e.g., An Imperfect Time Machine).
We have already done a lot of work on the database: writing and improving the Code Book, gathering data on a small subset of geographic regions, and building a network of historians and archaeologists, without whose help this project couldn’t possibly succeed.
Currently we are using a Wiki (software on which the Wikipedia is based) to collaboratively enter and improve the data. The Wiki-based approach makes sense for the initial phase of building the database, because it is extremely flexible. The problem is that it’s impossible to design a perfect code book (instructions on how to code data) until you have coded a big chunk of data. It’s like in war, no battle plan survives contact with the enemy (according to the sagacious Helmuth von Moltke, the Elder). Database-building is just like war, you have to constantly modify your code book as you encounter new challenges from different societies that you code.
Although this process actually continues indefinitely (or until you are done with building the database), most of the change takes place in the early phases. Eventually, the Code Book starts to settle down. In other words, revisions and changes occur at increasingly longer intervals.
This is why I insisted that we do not try to design a relational database before collecting data. We are using the Wiki, but when we see that the Code Book is getting there, we will switch to a more formal database approach.
And this is what we have been discussing over the last two days with the database specialists from Trinity College Dublin. Initially I thought that we would move to a relational database. This is a standard approach and it involves building a collection of tables storing data.
The problem is that once you’ve designed your database, it’s rather difficult to change it without massive disruption and recoding large swaths of data. Which is why I wanted the Code Book to mature before we make the move.
Fortunately, as I learned from our Irish colleagues, the state of the art in databases has now moved beyond relational databases. A much more powerful and flexible approach is what is known as the RDF (Resource Description Framework).
The RDF is based on making statements in the form of a subject-predicate-object triple. This sounds rather abstract, but is actually quite natural. So for example, we want to code the capital of the Byzantine Empire. This means the Byzantine Empire is the subject (the entity we want to know something about). Capital is the predicate (the property that we want to know about it). And Constantinople is the object (the value). So the statement becomes Byzantine Empire-capital-Constantinople, which is a computer representation of the sentence that Constantinople was the capital of the Byzantine Empire.
The data is stored as such triples, and by stringing them together it’s possible to describe much more complex datasets than one that has to be shoehorned into a table.
Or even a bunch of tables. Of course, as often happens in mathematics, everything you can do with RDF you can also do with relational databases, but it requires more work. Especially, if you want to modify the way you code things, it’s easier (although not costless) to do it within the RDF framework. Which is why we want to implement it only after we have worked most of the wrinkles by hand.
Our ultimate goal is to code all known human history, from the Neolithic period to the present (‘history’ here is understood broadly, and includes both traditional history and archaeology, as well as historical climatology, etc.). This is a huge amount of data and coding it all by hand, using just human intelligence, is prohibitively expensive. So we need to use artificial intelligence.
Despite many promises of how computers would eventually replace human beings, the reality is that computers remain rather dumb. So involving humans is essential, and will continue to remain so for the foreseeable future (perhaps this is not so bad!). However, our primary resource, academic historians and archaeologists are very busy people. We found experimentally that most will not code data for us if we just ask them to fill in big questionnaires with hundreds of empty boxes. (A few do, and we are extremely appreciative of their help and enthusiasm). So we need to find ways to employ their scarce labor as efficiently as possible. And that’s the main motivation of the scheme that we are working towards.
First we send the dumb web-crawlers out to find ‘data candidates.’ For example, we want to know many things about thousands of obscure historical states, for example, the Kingdom of Nan Zhao. What is known about their populations – how many people these states ruled? Anybody who has spent hours fruitlessly looking for a particular bit of information using Google, for example, knows how frustrating such an experience can be. So the idea is to automate it.
I call them ‘dumb’, but actually these web-crawlers are becoming quite intelligent and good at zeroing in on particular bits of data. Still, they will provide many ‘false positives’ and we need a human being to filter the good from the bad.
But this human being should not be an expert. We are counting on recruiting dozens of volunteers who will go through the data candidates and reject the inappropriate ones. Let’s call them ‘harvesters.’ A harvester will quickly go through a bunch data candidates, and click ‘no’, no, no, for all the bad ones. When a web-crawler yields a good tidbit, a harvester will accept it. The program will automatically try to fill in various fields where data needs to be entered, and the harvester’s job is to either accept or edit each entry.
At the next stage the data candidate is forwarded to an expert. Most data at this point will be common knowledge among scholars, so all the expert needs to do is approve it. But some proportion will come from a bad source (a harvester may not recognize that a particular book or article was written by an author that has an axe to grind, but the expert will know such details). Also, the state of knowledge can change, and what was ‘common knowledge’ twenty years ago may be replaced by another ‘common knowledge.’ In any case, the important point here is that experts only do those things that nobody else can. One such thing is telling us whether the reason that web crawlers and harvesters did not find any information about some property we want to code because they used wrong searching methods, or because really nobody knows. Only an expert can determine this.
Experts can also become involved at an earlier stage, by telling harvesters which textbooks to use, and which to ignore (to harvest ‘common knowledge’). Finally, we will want to have a collection of articles, written by experts, where they describe particular historical states, provide a general framework tying various coded variables together, and perhaps discuss difficulties and uncertainties involved in assigning values to particular variables.
Finally, in addition to web-crawlers, harvesters, and experts, we have what our Irish colleagues call ‘architects.’ These are people who are responsible for updating and rewriting the Code Book – instructions on how to code various characteristics of historical societies. Both harvesters (especially the more experienced ones) and experts can (and are expected to) make suggestions about how to improve the coding scheme. Changing the code book has repercussions, so it should not be done lightly. But there are ways of making changes that do not require recoding the data by hand.
Here’s an overview of the whole process (it also adds ‘consumers’ although we are at least 2-3 years away from publishing the database):
Image courtesy of Rob Brennan and Kevin Feeney
One big advantage of this approach is that it promises to get more and more efficient as data collection advances. As we the amount of data increases, we can use them to ‘train’ web-crawlers to become better at finding good data candidates. As web-crawlers become better, we can sic them on increasingly difficult (for a computer) to read texts.
There are huge collections of digital articles (e.g., JSTOR) and academic books (e.g., Google books). They are proprietary, but perhaps we can strike a deal which will allow us to let our web-crawlers lose on these texts and harvest them for data candidates.
It’s important to note that all such advances in artificial intelligence are not going to replace expert human beings any time soon (if ever). The point here is not to replace historians, but to make their involvement more efficient. Experts are not necessarily able to remember right away the article they need to answer the question about a particular characteristic of the society that they study. You remember that you read something about it, but finding the actual article can be quite frustrating. It’s much better to look through a list of 5-10 candidates to locate the article one was thinking about (or another one that answers the same question). This is where artificial intelligence can be of greatest help – working with human, rather than against one.
Many scholars in the humanities, including historians, feel threatened by the digital developments. Some are even afraid that they will be replaced by the machines. (This is a general worry, check out a very good book by Erik Brynjofsson and Andrew McAfee Race Against the Machine.) Personally I don’t think that academic historians are in danger of being replaced any time soon. But the machines don’t have to be the threat. We can race with the machines, rather than against them. We should use the computers to extend ourselves, make us more productive, and enable us to do things that wouldn’t be able to do otherwise. Such as building this massive historical database.
===========================================================
OK, to those of my readers who have read through this long, and in places technical post – thank you, and I am glad that you share my enthusiasm for ideas that bridge humanities, social sciences, and computers (which you must do, if you survived to the end of this post). Eventually we will be able to achieve goals that a few decades ago seemed complete science fiction.
Comments are always welcome on my blogs, and this one is not an exception! Also, if anybody is interested in participating in Seshat, we are now in the position to accept your kind help!
===========================================================
Correction: Joe Carroll has kindly informed me that I got my characterization of Dracula wrong. In actuality, Count Dracula scores highest on the category ‘Fear of a character.’ Antagonists (bad guys and gals) in general score higher than male protagonists (good guys) on ‘Interest.’ But the highest scores on Interest go to Monsieur Paul Emmanuel, the love interest for the female protagonist in Charlotte Brontë’s Vilette. Next highest was Becky Sharp, a sociopathic adventuress in Thackeray’s Vanity Fair. Next, Fitzwilliam Darcy, the female protagonist’s love interest in Pride and Prejudice. Read more about these fascinating research in Graphing Jane Austin.
I am interested in participating is Seshat, depending on time constraints. How much time per week would be necessary?
Hours are completely flexible. Obviously, there is a minimum level below which our investment in training you wouldn’t make sense, but my feeling is that any reasonable degree of involvement — 5-10 hours per week? Or more substantial involvement over limited periods? — would work. Send me a mail at peter dot turchin at uconn dot edu and we can take it from there.
Actually, thinking about it, 2-5 hours a week should also work, if this is extended over many weeks. In any case, much of it is simply reading history books and articles while keeping in mind things we want to code. So you should only agree if you enjoy reading such literature.
MIT has a dataset and supporting utilities called ConceptNet:
http://conceptnet5.media.mit.edu/
which contains content from the following sources in a format similar to an RDF triple:
“To begin with, ConceptNet 5 contains almost all the data from ConceptNet 4, created by contributors to the Open Mind Common Sense project.
Much of our knowledge comes from the English Wikipedia and its contributors, through two sources:
DBPedia extracts knowledge from the infoboxes that appear on articles.
ReVerb is a machine reading project, extracting relational knowledge from the actual text of each article.
We have also parsed a large amount of content from the English Wiktionary, including synonyms, antonyms, translations of concepts into hundreds of languages, and multiple labeled word senses for many English words.
More dictionary-style knowledge comes from WordNet.
Some knowledge about people’s intuitive word associations comes from “games with a purpose”. We learn things in English from the GWAP project’s word game Verbosity, and in Japanese from nadya.jp.”
The dataset is available for download in the hip new JSON format as well as good old fashioned CSV:
http://conceptnet5.media.mit.edu/downloads/current/
I once downloaded the YAGO dataset, which overlaps ConceptNet:
http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html
and imported them into an HDF5 file:
http://en.wikipedia.org/wiki/Hierarchical_Data_Format
which is a file based format generally used for high performance scientific computing (its strength is arrays of integers) but which can also be used for strings or characters. It stores data in a hierarchy not dissimilar from a hierarchical UNIX filesystem.
The file is 45 GB unnormalized.
Though there are RDF triple stores (considered part of the “NoSQL” or non-relational database fashion), there are other database types that are RDFish that could be of interest. Graph databases are one:
http://en.wikipedia.org/wiki/Graph_database
That might encourage usage of Seshat data in a way similar to that once proposed by the British television commentator James Burke for his knowledge web:
http://k-web.org/vision/
http://www.thebrain.com/community/big-thinkers1/james-burke/
Thanks, Lynn. This is quite a lot of info to process, but I am looking forward to delving into it.