22 October 2019

data collection

by project team

As with all machine learning, data is critical. Remember, our task at hand is to train a model to generate commentary on a given passage of bible. Our dual hypothesis is that this model will be able to 1) generate useful text that will fit the accepted commentary tradition and 2) inject novelty into our understanding of bible. In a sense, this is a competing optimization task. We want our model to sound like bible commentary and at the same time to introduce new perspectives or approaches to these bible corpora.

To accomplish this competing optimization task of similarity and difference, we are collecting two realted but different data sets to use in training our model. Again, thanks to the advances in generic pre-trained language models by labs at Google and OpenAI, we do not have to start from scratch with our model. Instead, we begin with a very capable generic language model in gpt-2. This generic language model can produce bible commentary without any additional training, yet two types of training will help make our model better.

General Knowledge about the Discourse

First, we are collecting as broad a set of general knowledge about bible and the discourse surrounding it as we can find. We can use sources such as wikipedia entries related to bible, social media posts reflecting on bible, popular and academic publications reflecting on bible, historical reflections on bible even back to antiquity. Ideally, this broad general knowledge data set would include perspectives from several different regions and cultures and people groups and traditions. At this stage in the project, we have not had the resources to gain access to many of these generic data sources, but it is a critical part of our project roadmap.

Specific Knowledge to the task of bible Commentary

Most machine learning models perform best when tailored toward a specific task. Taking a bible passage as input and producing commentary on that passage from a broad background is a particular kind of text generation. So, in addition to tuning gpt-2 to the general discourse of reflections on bible, we are cultivating a data set of structured commentary on bible passages to help gpt-2 learn more about the specific task of producing bible commentary. We have had several discussions as a team about what constitutes commentary on bible as well as where the boundaries are between bible and bible commnetary. We hope the insights of this workshop will help us more clearly define the boundaries of this task specific data set. To keep our early phase of this project very focused, we have chosen to work with the New Testament writing of Revelation and commentary on it from traditional Christian sources that are freely available online and limited to the English language. None of these initial narrow filters need remain for later stages of the project. We chose to focus on Revelation because it itself can be seen as a kind of commentary on much of the bible corpus and its language lends itself to the creative narratives sometimes generated by early versions of trained language models.

Our intial task specific data comes from the SWORD project of The Crosswire Bible Society. Their list of English commentaries includes mostly public domain commentaries that are dated. We have been granted access from the United Bible Societies to use their Translator’s Handbooks, which provide highly specific commentary on bible passages related to translating these texts around the globe. We have not yet been able to process these handbooks to incorporate into the model at this time.

We are distinctly aware of the limitations caused by our data collection decisions at this stage of the process. Finding openly available and machine readable corpora is always a challenge, particularly in a discourse that has been historically dominated by institutional structures. We are open to any suggestions you might have for data sources we might utilize, particularly sources that we can easily convert to machine readable text.

tags: