Topic Modeling: Turning Conversations into Strategy

R Ladies Abuja

Ifeoma Egbogah

The Rise of Unstructured Data

From Numbers to Words

If you’re in the field of analytics or data science, you’re likely well aware that data is being produced continuously—and at an increasingly rapid pace. (You might even be tired of hearing this repeated!) While analysts are typically trained to work with structured, numeric data in table formats, a significant portion of today’s data boom involves unstructured, text-based information.

Unstructured data represents 80-90% of all new enterprise data, according to Gartner.

Furthermore, it’s growing three times faster than structured data.

Behind the Words

Overview

In this webinar, we will look at:

  • Understanding Topic Modeling
  • Importance of Topi Modeling
  • Topic Modeling Techniques
  • Implementation of Topic Modeling
  • Demo

Understanding Topic Modeling

What is Topic Modeling?

Topic modeling is way to identify themes/semantic patterns in a corpus (complete document).

Topic modeling finds the relationships between words in the text, thereby identifying clusters of words that represent topics.

It is a like an amplified reading, a way to discover themes you may not see yourself.

Glossary:

Corpus: Group of documents Documents: Newspaper, Blogpost, Tweets, Articles, Journals, Customer reviews etc.

Importance of Topic Modeling

Topic Modeling Techniques

Latent Dirichlet Allocation (LDA)

Source: Introduction to Probabilistic Topic Models paper by Blei et. al

Latent Dirichlet allocation is one of the most common algorithms for topic modeling. It is guided by two principles, that:

  • Every document is a mixture of topics
  • Every topic is a mixture of words

Implementation of Topic Modeling

Step 1

Data Preparation

Collect the text data

Step 2

Preprocessing

Before modeling, we preprocess the data to put in it in a tidy format by:

  • Tokenization (splitting sentences into words)

  • Removing punctuation, numbers

  • Removing stop words (like the, and, is)

  • Find document-word counts

Step 3

Create Document-term Matrix

A matrix that represents the frequency of each word (term) across all documents.

We can cast a one-token-per-row table into a DocumentTermMatrix with tidytext’s cast_dtm().

Rows = documents; Columns = terms/words.

Step 4

Model Fitting

We can then use the LDA() function from the topicmodels package to create a topic model.

Step 5

Interprete and Visualise the Result

  • Extract top keywords per topic.

  • Label the topics manually (e.g., “Customer Service Issues” or “Product Features”).

  • Visualize using tools like:ggplot2` package

Step 6

Apply Result

Summarize the result, identify customer pain points, track emerging trends etc

Packages

Packages

We will make use of the following packages

tidyverse tidytext topicmodels tm

Demo