Lab 06 - Text Mining

Learning goals

Use tokenization and n-grams to extract features from text
Use pandas and plotnine to analyze and visualize text data
Apply sentence tokenization to analyze text structure
Apply topic modeling using scikit-learn and see if topics align with medical specialties

Lab description

For this lab we will be working with the medical record transcriptions from https://www.mtsamples.com/ available at https://github.com/salgadev/medical-nlp.

Deliverables

Questions 1-7 answered, rendered to html and uploaded to Quercus.
Add link to your github repo in your html.

Setup packages

You should have the following packages installed: pandas, numpy, plotnine, nltk, wordcloud, and scikit-learn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import re

# Plotting with ggplot2 syntax
from plotnine import *

# Text processing
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

# Word cloud
from wordcloud import WordCloud

# Topic modeling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

[nltk_data] Downloading package punkt to /Users/meredith/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/meredith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/meredith/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!

Read in the Medical Transcriptions

Loading in reference transcription samples from https://www.mtsamples.com/

mt_samples = pd.read_csv("https://raw.githubusercontent.com/salgadev/medical-nlp/master/data/mtsamples.csv")
mt_samples = mt_samples[['description', 'medical_specialty', 'transcription']]

mt_samples.head()

	description	medical_specialty	transcription
0	A 23-year-old white female presents with comp...	Allergy / Immunology	SUBJECTIVE:, This 23-year-old white female pr...
1	Consult for laparoscopic gastric bypass.	Bariatrics	PAST MEDICAL HISTORY:, He has difficulty climb...
2	Consult for laparoscopic gastric bypass.	Bariatrics	HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3	2-D M-Mode. Doppler.	Cardiovascular / Pulmonary	2-D M-MODE: , ,1. Left atrial enlargement wit...
4	2-D Echocardiogram	Cardiovascular / Pulmonary	1. The left ventricular cavity size and wall ...

Question 1: What specialties do we have?

We can use value_counts() from pandas to figure out how many different medical specialties are in the data. Are these categories related? overlapping? evenly distributed? Make a bar plot.

specialty_counts = mt_samples['medical_specialty'].____().reset_index()  # Hint: count categories
specialty_counts.columns = ['medical_specialty', 'count']

(ggplot(____, aes(x='reorder(medical_specialty, count)', y='count'))  # Hint: use the counts dataframe
 + geom_bar(stat='identity', fill='steelblue')
 + coord_flip()
 + labs(x='Medical Specialty', y='Count', title='Distribution of Medical Specialties')
 + theme_minimal()
 + theme(figure_size=(12, 8))
)

Summarize the top medical specialties:

Question 2: Tokenize

Tokenize the words in the transcription column
Count the number of times each token appears
Visualize the top 20 most frequent words with a bar plot

# Tokenize all transcriptions
def tokenize_text(text):
    if pd.isna(text):
        return []
    return ____(text.lower())  # Hint: NLTK word tokenizer

all_tokens = []
for text in mt_samples['transcription']:
    all_tokens.extend(tokenize_text(text))

# Count tokens
token_counts = ____  # Hint: count token frequencies
top_20 = token_counts.____(20)  # Hint: top N entries

# Create dataframe for plotting
top_20_df = pd.DataFrame(____, columns=['word', 'count'])  # Hint: the top-20 list

# Bar plot with plotnine
(ggplot(____, aes(x='reorder(word, count)', y='count'))  # Hint: the top-20 dataframe
 + geom_bar(stat='identity', fill='steelblue')
 + coord_flip()
 + labs(x='Word', y='Frequency', title='Top 20 Most Frequent Words')
 + theme_minimal()
 + theme(figure_size=(10, 6))
)

Summarize the top 20 tokens and explain what insights (if any) do we get?

Question 3: Stopwords

Redo Question 2 but remove stopwords
Use NLTK’s stopwords list
Use regex to remove numbers

# Get English stopwords
stop_words = set(stopwords.____('english'))  # Hint: get stopword list

# Add custom stopwords
custom_stopwords = {'patient', 'also', 'using', 'used'}
stop_words = stop_words.union(custom_stopwords)

# Filter tokens
filtered_tokens = [
    token for token in all_tokens
    if token not in stop_words
    and ____  # Hint: filter out pure numbers
    and len(token) > 2  # Remove very short tokens
    and token.isalpha()  # Keep only alphabetic tokens
]

# Count filtered tokens
filtered_counts = Counter(filtered_tokens)
top_20_filtered = filtered_counts.most_common(20)

# Create dataframe for plotting
top_20_filtered_df = pd.DataFrame(top_20_filtered, columns=['word', 'count'])

# Bar plot with plotnine
(ggplot(top_20_filtered_df, aes(x='reorder(word, count)', y='count'))
 + geom_bar(stat='identity', fill='steelblue')
 + coord_flip()
 + labs(x='Word', y='Frequency', title='Top 20 Most Frequent Words (Stopwords Removed)')
 + theme_minimal()
 + theme(figure_size=(10, 6))
)

# Word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(top_20_filtered))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud (Stopwords Removed)')
plt.show()

Summarize: What do we see when you remove stopwords? Does it give us a better idea of what the text is about?

Question 3b: Custom stopwords

Import clinical-specific stopwords from the github repository and apply additional filtering. These stopwords are tailored for medical/clinical text and include terms like “patient”, “medical”, “clinical”, etc.
Try further customizing your stopwords list to include 3-4 additional words of your own that do not appear informative

# Import clinical stopwords from the medical-nlp repository
import urllib.request

clinical_stopwords_url = "https://raw.githubusercontent.com/salgadev/medical-nlp/master/data/clinical-stopwords.txt"
response = urllib.request.urlopen(clinical_stopwords_url)
clinical_stopwords_raw = response.read().decode('utf-8')


clinical_stopwords = set()
for line in clinical_stopwords_raw.split('\n'):
    line = line.strip().lower()
    if line and not line.startswith('#'):
        clinical_stopwords.add(line)

print(f"Number of clinical stopwords: {len(____)}")
print(f"Sample clinical stopwords: {list(____)[:20]}")

# Combine NLTK stopwords with clinical stopwords
all_stopwords = stop_words.union(clinical_stopwords)

# Add custom stopwords
additional_stopwords = {____}  # Hint: add 3-4 clinical filler words
all_stopwords = all_stopwords.____(additional_stopwords)  # Hint: set merge

# Filter tokens using combined stopwords
clinical_filtered_tokens = [
    token for token in all_tokens
    if token not in all_stopwords
    and not re.match(r'^\d+$', token)
    and len(token) > 2
    and token.isalpha()
]

# Count filtered tokens
clinical_filtered_counts = Counter(____)  # Hint: tokens after clinical filtering
top_20_clinical = clinical_filtered_counts.most_common(____)  # Hint: top N

# Create dataframe for plotting
top_20_clinical_df = pd.DataFrame(____, columns=['word', 'count'])  # Hint: top list for plotting

# Bar plot with plotnine
(ggplot(____, aes(x='reorder(word, count)', y='count'))  # Hint: clinical top-20 dataframe
 + geom_bar(stat='identity', fill='tomato')
 + coord_flip()
 + labs(x='Word', y='Frequency', title='Top 20 Most Frequent Words (Clinical Stopwords Removed)')
 + theme_minimal()
 + theme(figure_size=(10, 6))
)

Summarize: How many clinical stopwords are there? What additional words did you remove? Do the results differ significantly when using clinical stopwords vs. general stopwords?

Question 4: Bigrams

Tokenize the stopword-filtered transcriptions into bigrams and visualize the top 20 most frequent bigrams.

def get_bigrams_from_text(text):
    if pd.isna(____):  # Hint: current text input
        return []
    tokens = word_tokenize(text.____())  # Hint: lowercase text
    # Filter tokens using clinical stopwords
    tokens = [t for t in tokens if t not in all_stopwords and t.isalpha() and len(t) > ____]  # Hint: minimum length
    return list(____(tokens, 2))  # Hint: bigram helper


all_bigrams = []
for text in mt_samples['transcription']:
    all_bigrams.extend(get_bigrams_from_text(text))

bigram_counts = ____  # Hint: count bigrams
top_20_bigrams = bigram_counts.most_common(20)

# Create dataframe for bigrams
bigram_df = pd.DataFrame([
    {'bigram': ' '.join(bg), 'count': count}
    for bg, count in top_20_bigrams
])


(ggplot(____, aes(x='reorder(bigram, count)', y='count'))  # Hint: bigram dataframe
 + geom_bar(stat='identity', fill='steelblue')
 + coord_flip()
 + labs(x='Bigram', y='Frequency', title='Top 20 Bigrams')
 + theme_minimal()
 + theme(figure_size=(10, 6))
)

Summarize: do the bigrams make sense

Question 5: Examining bigram words

Using the results from the bigram, pick a word and count the words that appear before and after it, and create a plot of the top 20.

# Pick a word to examine (e.g., 'blood', 'operating', 'diagnosis')
target_word = '____'  # Hint: choose a common token

# Find bigrams containing the target word
before_words = []
after_words = []

for bigram, count in bigram_counts.items():
    if bigram[1] == ____:  # Hint: compare to chosen token
        before_words.extend([bigram[0]] * count)
    if bigram[0] == ____:  # Hint: compare to chosen token
        after_words.extend([bigram[1]] * count)

# Count words
before_counts = Counter(before_words).____(20)  # Hint: top N
after_counts = Counter(after_words).____(20)  # Hint: top N

# Create dataframes
before_df = pd.DataFrame(____, columns=['word', 'count'])  # Hint: counts before target
before_df['position'] = 'before'

after_df = pd.DataFrame(____, columns=['word', 'count'])  # Hint: counts after target
after_df['position'] = 'after'

# Plot words before
(ggplot(____, aes(x='reorder(word, count)', y='count'))  # Hint: dataframe of words before
 + geom_bar(stat='identity', fill='steelblue')
 + coord_flip()
 + labs(x='Word', y='Frequency', title=f'Words Before "{____}"')  # Hint: chosen word
 + theme_minimal()
 + theme(figure_size=(7, 6))
)

# Plot words after
(ggplot(____, aes(x='reorder(word, count)', y='count'))  # Hint: dataframe of words after
 + geom_bar(stat='identity', fill='coral')
 + coord_flip()
 + labs(x='Word', y='Frequency', title=f'Words After "{____}"')  # Hint: chosen word
 + theme_minimal()
 + theme(figure_size=(7, 6))
)

Briefly summarize the bigram before and after words.

Question 6: Sentence Tokenization

Tokenize the transcriptions into sentences and analyze sentence-level statistics.

Count the number of sentences per transcription
Calculate the average sentence length (in words) per transcription
Plot the distribution of average sentence length
Make a boxplot of the average sentence length by specialty

# Tokenize transcriptions into sentences
def get_sentence_stats(text):
    if pd.isna(text):
        return {'num_sentences': 0, 'avg_sentence_length': 0}

    sentences = ____(text)  # Hint: sentence tokenizer
    num_sentences = len(sentences)

    if num_sentences == 0:
        return {'num_sentences': 0, 'avg_sentence_length': 0}

    # Calculate average sentence length in words
    sentence_lengths = [len(word_tokenize(sent)) for sent in sentences]
    avg_length = np.mean(sentence_lengths)

    return {'num_sentences': ____, 'avg_sentence_length': ____}  # Hint: use computed values

# Apply to all transcriptions
sentence_stats = mt_samples['transcription'].apply(get_sentence_stats)
mt_samples['num_sentences'] = ____.apply(lambda x: x['num_sentences'])  # Hint: stats series
mt_samples['avg_sentence_length'] = sentence_stats.apply(lambda x: x['avg_sentence_length'])

# Summary statistics
print("Sentence Statistics Summary:")
print(mt_samples[['num_sentences', 'avg_sentence_length']].describe())

# Aggregate sentence stats by specialty for boxplot
specialty_sentence_stats = (
    mt_samples[mt_samples['avg_sentence_length'] > 0]
    .groupby('medical_specialty', as_index=False)['avg_sentence_length']
    .mean()
)

(ggplot(mt_samples[mt_samples['____'] > 0], aes(x='avg_sentence_length'))  # Hint: column name
 + geom_histogram(bins=30, fill='coral', color='white', alpha=0.7)
 + labs(x='Average Sentence Length (words)', y='Count', title='Distribution of Average Sentence Length')
 + theme_minimal()
 + theme(figure_size=(10, 5))
)

(ggplot(____, aes(x='reorder(medical_specialty, avg_sentence_length)', y='avg_sentence_length'))  # Hint: aggregated by specialty
 + geom_boxplot(fill='coral', alpha=0.7)
 + coord_flip()
 + labs(x='Medical Specialty', y='Average Sentence Length (words)', title='Average Sentence Length by Medical Specialty')
 + theme_minimal()
 + theme(figure_size=(10, 6))
)

Summarize: Do you notice any patterns sentence length across different medical specialties?

Question 7: Topic Models

See if there are any themes in the data by using a topic model (LDA). - Remove combined NLTK + clinical stopwords - Use scikit-learn’s CountVectorizer to create a document-term matrix - Use LatentDirichletAllocation for topic modeling - Try different k (n_components) values (try 3 and 5) - Create a visualization of the topics

# Prepare text data
texts = mt_samples['transcription'].dropna().tolist()

vectorizer = CountVectorizer(
    max_df=0.95,  # Ignore terms that appear in >95% of documents (filters common words)
    min_df=2,     # Ignore terms that appear in <2 documents (ignores very rare words)
    stop_words=list(all_stopwords),
    max_features=1000
)
dtm = vectorizer.fit_transform(texts)

n_topics = ____  # Hint: number of topics
lda = LatentDirichletAllocation(
    n_components=____,  # Hint: use topic count
    random_state=42,
    max_iter=10
)
lda.fit(dtm)

feature_names = vectorizer.get_feature_names_out()

# Extract top words for each topic into a dataframe
def get_topic_df(model, feature_names, n_top_words=10):
    rows = []
    for topic_idx, topic in enumerate(model.components_):
        top_indices = topic.argsort()[:-n_top_words - 1:-1]
        for rank, idx in enumerate(top_indices):
            rows.append({
                'topic': f'Topic {topic_idx + 1}',
                'word': feature_names[idx],
                'weight': topic[idx],
                'rank': rank
            })
    return pd.DataFrame(rows)

topic_df = get_topic_df(____, ____, ____)  # Hint: model, vocab list, top words

# Display top words per topic
for topic in topic_df['topic'].unique():
    words = topic_df[topic_df['topic'] == topic]['word'].tolist()
    print(f"{topic}: {', '.join(words)}")

# Visualize topics with faceted plot
(ggplot(____, aes(x='reorder(word, weight)', y='weight'))  # Hint: topic dataframe
 + geom_bar(stat='identity', fill='steelblue')
 + coord_flip()
 + facet_wrap('~topic', scales='free_y', ncol=3)
 + labs(x='Word', y='Weight', title='Top Words per Topic (LDA)')
 + theme_minimal()
 + theme(figure_size=(14, 8), strip_text=element_text(size=10))
)

Summarize: is it clearer when there are 3 or 5 topics?

Now do a cross tab of the 5 topics and the medical specialties.

# Assign documents to topics and compare with actual specialties
doc_topic_dist = lda.transform(dtm)
dominant_topics = doc_topic_dist.argmax(axis=1)

# Add to dataframe
mt_samples_with_topics = mt_samples.dropna(subset=['transcription']).copy()
mt_samples_with_topics['dominant_topic'] = [f'Topic {t+1}' for t in dominant_topics]

# Cross-tabulation of topics vs specialties (top 5 specialties)
top_5_specialties = mt_samples_with_topics['medical_specialty'].value_counts().head(____).index.tolist()  # Hint: number of specialties
cross_tab = pd.crosstab(
    mt_samples_with_topics[mt_samples_with_topics['medical_specialty'].isin(____)]['medical_specialty'],
    mt_samples_with_topics[mt_samples_with_topics['medical_specialty'].isin(____)]['dominant_topic']
)
print("\nCross-tabulation of Topics vs Top 5 Specialties:")
print(cross_tab)

Summarize: What themes emerge from the topic modeling? Do the topics align with the medical specialties?