Topic Model

Here we discuss topic modeling as a potential example of a thinking machine. To better facilitate this portion of our presentation we are interweaving snippets of code, a data-visualization, and discussion. Readers uninterested in the code blocks may skip over them without losing the overall point of this section (code blocks appear in purple). On the whole, we avoid technical details about the process of creating and using a topic model in favor of higher level and intuitive explanations. Those interested in some of the technical details are free to visit our Github repository.

What is a Topic Model?

A topic model is an unsupervised machine learning technique in which the machine finds patterns of word co-occurrence across a corpus of documents. (There are also supervised machine learning techniques, in which we would give guidelines to the machine on what categories to create, and the machine would organize the words in the texts into these categories.) These patterns of word co-occurrence are then grouped together into “topics.” (Technically, the model assigns a probability value of a given word belonging to a given topic.) For example, if we were to take a number of blogs about pets, the topic model would notice that certain words tend to appear together across the blogs such as “cat,” feline,” “litter box,” “claws,” and “meow.” The topic model may also notice another group of words appearing together across the blogs such as “dog,” “ canine,” “leash,” and “bark.” The topic model would identify the first group of words as “Topic 1” and the second group of words as “Topic 2.” The human user of the topic model could further identify “Topic 1” as “Cats” and “Topic 2” as “Dogs.” In this way, a topic model’s understanding of a corpus of documents scales far more efficiently than that of a human. Unsupervised models also create the important possibility that the machine will identify patterns and topics that the human researcher was not expecting. Does this allow AI to be a research partner?

Topic Model of Philosophical Journals

For the purposes of this presentation, we built a topic model using articles from these philosophical journals: Philosophical Review, Mind, Nous, Journal of Philosophy, and Philosophy and Phenomenological Research. We obtained these articles from JSTOR Data for Research. After filtering out articles that were nothing more than volume information, we created a corpus totaling 41,600 articles dating from 1892 to 2013. The code below creates a visualization of the model. The circles on the left represent each topic in terms of how much of the corpus it represents; the location of circles on the x, y axis indicates how similar a given topic is to other topics in the corpus. The words on the right constitute a given topic. You can navigate through each topic by using the buttons in the gray area on the top left or by hovering over each circle with your cursor. Again, leaving technical details aside we ask you to interact with this visualization and think about how you would label the topics and whether or not this model produces meaningful results.

Code:

# Here we import some Python libraries to visualize the model we already created
from gensim import corpora, models, similarities
import pyLDAvis.gensim
import spacy

# This section of code simply opens the model and supplementary data useful for loading the model
path = '../models/03/'
dictionary = corpora.Dictionary.load(path + '03.dict')
corpus = corpora.MmCorpus(path + '03.mm')
model = models.ldamodel.LdaModel.load(path + '03.model')

# These lines of code generate the visualization of the topic model.
model_viz = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False, mds='pcoa')
pyLDAvis.display(model_viz)

Discussion:

Data visualizations in general, and the above visualization specifically, are objects of interpretation in that they both interpret data and require interpretation. Here, topic 11 and topic 31 each appear to be topics about the philosophy of mind. Topic 11 features the term “body” in the top five most prominent terms along with “mind,” “mental,” “physical”, and “consciousness.” This suggests that topic 11 could be narrowed from a broader category of philosophy of mind to a narrower discussion of the mind-body problem. Since topic 11 is important for our future research, we wrote the code below to find articles that are characterized by this topic.

Technical Note: The visualization above numbers the topics from 1 to 40 for ease of the human user. However, the model itself numbers the topics from 0 to 39. Zero-based indexing is common in programming. The consequence of this is that Topic 11 in the visualization above is actually Topic 10 in the code below.

Code:

# load metadata
import json
with open(path + '03.json', 'r') as json_file:
    metadata = json.load(json_file)
    
# find articles in corpus which also feature the topic under consideration
for key in metadata.keys():
    if metadata[key]['topics'][0] == 10:
        article_dict = metadata[key]
        author = article_dict['author']
        title = article_dict['title']
        journal = article_dict['journal']
        year = article_dict['year']
        print(key + ': ' + author.title() + ',' + '"'+ title.title() + '" ' + journal.title() + ' ' + year)

doc_984: Larrabee,"Book Review" The Journal Of Philosophy 1928
doc_1556: None,"Book Review" The Journal Of Philosophy 1936
doc_1632: None,"Book Review" The Journal Of Philosophy 1935
doc_3267: Mcg,"Book Review" The Journal Of Philosophy 1939
doc_3938: Krikorian,"An Empirical Definition Of Consciousness" The Journal Of Philosophy 1938
doc_7395: Plantinga,"Comments" The Journal Of Philosophy 1965
doc_8521: Long,"The Bodies Of Persons" The Journal Of Philosophy 1974
doc_8963: Dennett,"Content And Consciousness: Reply To Arbib And Gunderson" The Journal Of Philosophy 1972
doc_10448: None,"Erratum To "Mind", Vol. 117, Number 468, October 2008" Mind 2009
doc_11716: Kuiper,"Roy Wood Sellars On The Mind-Body Problem" Philosophy And Phenomenological Research 1954
doc_11717: Frankena,"Sellars' Theory Of Valuation" Philosophy And Phenomenological Research 1954
doc_13913: Whallon,"Unconscious Mental Events" Philosophy And Phenomenological Research 1965
doc_13914: Landesman,"Reply To Professor Whallon" Philosophy And Phenomenological Research 1965
doc_13915: Schwarz,"Professor Engel On Kant" Philosophy And Phenomenological Research 1965
doc_15140: Moody,"Distinguishing Consciousness" Philosophy And Phenomenological Research 1986
doc_15875: Rosenthal,"Multiple Drafts And Higher-Order Thoughts" Philosophy And Phenomenological Research 1993
doc_15925: Kim,"Mental Causation In Searle'S "Biological Naturalism"" Philosophy And Phenomenological Research 1995
doc_15926: Jacob,"Consciousness, Intentionality And Function. What Is The Right Order Of Explanation?" Philosophy And Phenomenological Research 1995
doc_18722: Gregory,"Mind, Body, Theism And Immortality" The Philosophical Review 1919
doc_21232: Wiener,"Book Review" The Philosophical Review 1938
doc_24316: Fodor,"Book Review" The Philosophical Review 1971
doc_24782: Arner,"Book Review" The Philosophical Review 1984
doc_24783: Stabler,,"Book Review" The Philosophical Review 1984
doc_25681: Coburn,"Book Review" The Philosophical Review 1995
doc_25682: Mele,"Book Review" The Philosophical Review 1995
doc_26093: Nakhnikian,"Abstract Of Comments" Noûs 1976
doc_28793: None,"Note" Mind 1898
doc_30078: None,"Notes And News" Mind 1917
doc_30358: Gregory,"Do We Know Other Minds Mediately Or Immediately?" Mind 1920
doc_32775: Prado,"Fragmenting Subjects" Mind 1972
doc_33098: Kim,"Cartesian Dualism And The Unity Of A Mind" Mind 1971
doc_33099: Crawford,"Conforming To Custom" Mind 1971
doc_33654: Jaeger,"Notes On The Logic Of Physicalism" Mind 1979
doc_36540: Merricks,"On Whether Being Conscious Is Intrinsic" Mind 1998
doc_37691: Patrick,"The Emergent Theory Of Mind" The Journal Of Philosophy 1922
doc_37692: Whitmore,"Two Notes On Esthetics" The Journal Of Philosophy 1922
doc_38484: Smith,"Book Review" Mind 2007
doc_39217: Shapiro,"Book Review" Mind 2005
doc_39218: Hannan,"Book Review" Mind 2005
doc_39365: Rosenthal,"Book Review" Mind 2004
doc_39945: Montero,"Consciousness Is Puzzling, But Not Paradoxical" Philosophy And Phenomenological Research 2004
doc_40966: Heil,"Book Review" The Philosophical Review 2008

Using the Topic Model to Understand a New Text

As we have seen, topic models are useful for understanding what a large corpus of documents is about. But they can also be used to see what new documents are about as well as find similar documents. In fact, the tool we used to make our topic model, the Python library Gensim, is also used by Amazon, Cisco, and City Bank for similar purposes. In the code blocks that follow we first use the model to identify the topics in a new document and then we use the model to find similar documents. For demonstration purposes we will introduce an article about the Mind-Body problem which the model has not yet seen:

McGinn, Colin. “What Constitutes the Mind-Body Problem?” Philosophical Issues 13 (2003): 148-62.

Code:

# create an index for comparing documents in the model
index = similarities.MatrixSimilarity(model[corpus])

# create the same text processing pipeline used in processing other documents in the model
def process_text(string, custom_stops={}):
    """Process text using SpaCy"""
    nlp = spacy.load('en_core_web_sm')
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    roman_numerals = {'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv',
                      'xvi', 'xvii', 'xviii', 'xix', 'xx', 'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii',
                      'xxviii', 'xxix', 'xxx', 'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii',
                      'xxxix', 'xl', 'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l'}
    stop_words = stop_words.union(roman_numerals)
    stop_words = stop_words.union(custom_stops)
    doc = nlp(string)
    tokens = [token for token in doc]
    lemmas_alpha = [token.lemma_ for token in tokens if token.is_alpha]
    lemmas_no_pron = [lemma for lemma in lemmas_alpha if lemma != '-PRON-']
    lemmas_final = [lemma for lemma in lemmas_no_pron if lemma not in stop_words]
    return lemmas_final

# load McGinn article
with open('./mcGinn.txt', 'r') as f:
    MCGINN_doc = f.read()

# use same list of stop words as was used for building the model
custom_stop_words = {'address', 'article', 'association', 'author', 'blackwell', 'book', 'cambridge', 'chapter',
                         'chicago', 'cit', 'cloth', 'co', 'college', 'committee', 'conference', 'david', 'de',
                         'department', 'der', 'des', 'dr', 'ed', 'edition', 'eds', 'essay', 'follow', 'introduction',
                         'john', 'journal', 'les', 'london', 'meeting', 'mit', 'mr', 'note', 'op', 'oxford', 'page',
                         'paper', 'paul', 'pp',  'president', 'press', 'prof', 'professor', 'publish', 'richard',
                         'robert', 'routledge', 'society', 'subscription', 'uk', 'und', 'vol', 'volume', 'volumne',
                         'von', 'william', 'york'}

# porcess McGinn article
processed_MCGINN = process_text(string=MCGINN_doc, custom_stops= custom_stop_words)

# transform McGinn article into a word vector
vec_MCGINN = dictionary.doc2bow(processed_MCGINN)

# compare McGinn article with other documents in the model
results_MCGINN = model[vec_MCGINN]

# display results
print(results_MCGINN)

[(1, 0.029276928), (6, 0.020385537), (7, 0.10689293), (10, 0.12548093), (12, 0.021687701), (14, 0.12012055), (16, 0.026124151), (17, 0.04934206), (18, 0.014159113), (19, 0.013588086), (21, 0.01995634), (22, 0.021290451), (23, 0.13682355), (27, 0.115806594), (28, 0.013152655), (30, 0.05169148), (37, 0.026648736), (39, 0.056570943)]

Discussion:

The results printed above may not look informative, but the model is telling us what the new document is about. Inside each parenthesis are two numbers. The first number represents a topic, the second number (a decimal number) tells us the proportion of the new document that belongs to the topic. It appears that our model could use some fine-tunning to produce better results. However, we can, at the least, take this as an example of a machine understanding a text and thinking about how it relates to a larger data set.

Using the Topic Model to Find Similar Texts

Given that the topic model can identify topics in a new document, it is not a surprise that it can be used to find documents in the model similar to the new document. The ability of the topic model to identify similarities is what makes this kind of machine learning so useful in making recommendations. In the code below we do just that, we ask the model to find the five documents in our original corpus which are most like the new document.

Code:

index.num_best = 5  # set index to generate 10 best results
matches = (index[results_MCGINN])
for match in matches:
    score = (match[1])
    score = str(score)
    key = 'doc_' + str(match[0])
    article_dict = metadata[key]
    author = article_dict['author']
    title = article_dict['title']
    journal = article_dict['journal']
    year = article_dict['year']
    print(key + ': ' + author.title() + ',' + '"'+ title.title() + '" ' + journal.title() + ' ' + year + '\n\tsimilarity score -> ' + score + '\n')

doc_41491: Price,"Book Review" The Philosophical Review 2012
	similarity score -> 0.8419407606124878

doc_6972: Olson,"Knowing What We Mean" The Journal Of Philosophy 1959
	similarity score -> 0.8052636981010437

doc_15585: Schick,,"The Epistemic Role Of Qualitative Content" Philosophy And Phenomenological Research 1992
	similarity score -> 0.7982875108718872

doc_35205: Mcginn,"Can We Solve The Mind--Body Problem?" Mind 1989
	similarity score -> 0.7979500889778137

doc_36992: Montero,"The Body Problem" Noûs 1999
	similarity score -> 0.7865279912948608

Discussion:

The similarity score mentioned in the output above measures how similar the documents are to the McGinn article. Each document is being modeled on a vector space with as many dimensions as there are significant features of the document. (In this case there are as many significant features are there are unique vocabulary items in our model.) The model is measuring the distance between all of those features and calculating the similarity between the features in McGinn article and the features of the other documents which are “nearest” to it. Notice that the McGinn article under consideration here bears some similarity to an older article by the same author on a similar topic.

Topic Models and Anima

Topic models have been around for many years now, but we still find this kind of machine learning useful for our research. Additionally, we suggest that topic models are an example of machine learning which allows us to think about AI and anima in a practical way. Is the model, or better the machine building and employing the model, an example of Aristotle’s anima? It may be easy to reduce a topic model to the mere reduction of texts to numbers and the application of probabilities to those numbers However, this invites us to consider the question “what is “thinking” or “how is thinking embodied and applied?” Next we consider a more recent example of AI which raises similar questions and, we believe, provokes a sense of the uncanny.

Previous: Thesis

Next: GPT-2