Use tokenization and n-grams to extract features from text
Use pandas and plotnine to analyze and visualize text data
Apply sentence tokenization to analyze text structure
Apply topic modeling using scikit-learn and see if topics align with medical specialties
Lab description
For this lab we will be working with the medical record transcriptions from https://www.mtsamples.com/ available at https://github.com/salgadev/medical-nlp.
Deliverables
Questions 1-7 answered, rendered to html and uploaded to Quercus.
Add link to your github repo in your html.
Setup packages
You should have the following packages installed: pandas, numpy, plotnine, nltk, wordcloud, and scikit-learn.
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom collections import Counterimport re# Plotting with ggplot2 syntaxfrom plotnine import*# Text processingimport nltkfrom nltk.tokenize import word_tokenize, sent_tokenizefrom nltk.corpus import stopwordsfrom nltk.util import ngramsnltk.download('punkt')nltk.download('stopwords')nltk.download('punkt_tab')# Word cloudfrom wordcloud import WordCloud# Topic modelingfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.decomposition import LatentDirichletAllocation
[nltk_data] Downloading package punkt to /Users/meredith/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/meredith/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data] /Users/meredith/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
Read in the Medical Transcriptions
Loading in reference transcription samples from https://www.mtsamples.com/
We can use value_counts() from pandas to figure out how many different medical specialties are in the data. Are these categories related? overlapping? evenly distributed? Make a bar plot.
specialty_counts = mt_samples['medical_specialty'].____().reset_index() # Hint: count categoriesspecialty_counts.columns = ['medical_specialty', 'count'](ggplot(____, aes(x='reorder(medical_specialty, count)', y='count')) # Hint: use the counts dataframe+ geom_bar(stat='identity', fill='steelblue')+ coord_flip()+ labs(x='Medical Specialty', y='Count', title='Distribution of Medical Specialties')+ theme_minimal()+ theme(figure_size=(12, 8)))
Summarize the top medical specialties:
Question 2: Tokenize
Tokenize the words in the transcription column
Count the number of times each token appears
Visualize the top 20 most frequent words with a bar plot
# Tokenize all transcriptionsdef tokenize_text(text):if pd.isna(text):return []return ____(text.lower()) # Hint: NLTK word tokenizerall_tokens = []for text in mt_samples['transcription']: all_tokens.extend(tokenize_text(text))# Count tokenstoken_counts = ____ # Hint: count token frequenciestop_20 = token_counts.____(20) # Hint: top N entries# Create dataframe for plottingtop_20_df = pd.DataFrame(____, columns=['word', 'count']) # Hint: the top-20 list# Bar plot with plotnine(ggplot(____, aes(x='reorder(word, count)', y='count')) # Hint: the top-20 dataframe+ geom_bar(stat='identity', fill='steelblue')+ coord_flip()+ labs(x='Word', y='Frequency', title='Top 20 Most Frequent Words')+ theme_minimal()+ theme(figure_size=(10, 6)))
Summarize the top 20 tokens and explain what insights (if any) do we get?
Question 3: Stopwords
Redo Question 2 but remove stopwords
Use NLTK’s stopwords list
Use regex to remove numbers
# Get English stopwordsstop_words =set(stopwords.____('english')) # Hint: get stopword list# Add custom stopwordscustom_stopwords = {'patient', 'also', 'using', 'used'}stop_words = stop_words.union(custom_stopwords)# Filter tokensfiltered_tokens = [ token for token in all_tokensif token notin stop_wordsand ____ # Hint: filter out pure numbersandlen(token) >2# Remove very short tokensand token.isalpha() # Keep only alphabetic tokens]# Count filtered tokensfiltered_counts = Counter(filtered_tokens)top_20_filtered = filtered_counts.most_common(20)# Create dataframe for plottingtop_20_filtered_df = pd.DataFrame(top_20_filtered, columns=['word', 'count'])# Bar plot with plotnine(ggplot(top_20_filtered_df, aes(x='reorder(word, count)', y='count'))+ geom_bar(stat='identity', fill='steelblue')+ coord_flip()+ labs(x='Word', y='Frequency', title='Top 20 Most Frequent Words (Stopwords Removed)')+ theme_minimal()+ theme(figure_size=(10, 6)))# Word cloudwordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(top_20_filtered))plt.figure(figsize=(10, 5))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.title('Word Cloud (Stopwords Removed)')plt.show()
Summarize: What do we see when you remove stopwords? Does it give us a better idea of what the text is about?
Question 3b: Custom stopwords
Import clinical-specific stopwords from the github repository and apply additional filtering. These stopwords are tailored for medical/clinical text and include terms like “patient”, “medical”, “clinical”, etc.
Try further customizing your stopwords list to include 3-4 additional words of your own that do not appear informative
# Import clinical stopwords from the medical-nlp repositoryimport urllib.requestclinical_stopwords_url ="https://raw.githubusercontent.com/salgadev/medical-nlp/master/data/clinical-stopwords.txt"response = urllib.request.urlopen(clinical_stopwords_url)clinical_stopwords_raw = response.read().decode('utf-8')clinical_stopwords =set()for line in clinical_stopwords_raw.split('\n'): line = line.strip().lower()if line andnot line.startswith('#'): clinical_stopwords.add(line)print(f"Number of clinical stopwords: {len(____)}")print(f"Sample clinical stopwords: {list(____)[:20]}")
# Combine NLTK stopwords with clinical stopwordsall_stopwords = stop_words.union(clinical_stopwords)# Add custom stopwordsadditional_stopwords = {____} # Hint: add 3-4 clinical filler wordsall_stopwords = all_stopwords.____(additional_stopwords) # Hint: set merge# Filter tokens using combined stopwordsclinical_filtered_tokens = [ token for token in all_tokensif token notin all_stopwordsandnot re.match(r'^\d+$', token)andlen(token) >2and token.isalpha()]# Count filtered tokensclinical_filtered_counts = Counter(____) # Hint: tokens after clinical filteringtop_20_clinical = clinical_filtered_counts.most_common(____) # Hint: top N# Create dataframe for plottingtop_20_clinical_df = pd.DataFrame(____, columns=['word', 'count']) # Hint: top list for plotting# Bar plot with plotnine(ggplot(____, aes(x='reorder(word, count)', y='count')) # Hint: clinical top-20 dataframe+ geom_bar(stat='identity', fill='tomato')+ coord_flip()+ labs(x='Word', y='Frequency', title='Top 20 Most Frequent Words (Clinical Stopwords Removed)')+ theme_minimal()+ theme(figure_size=(10, 6)))
Summarize: How many clinical stopwords are there? What additional words did you remove? Do the results differ significantly when using clinical stopwords vs. general stopwords?
Question 4: Bigrams
Tokenize the stopword-filtered transcriptions into bigrams and visualize the top 20 most frequent bigrams.
def get_bigrams_from_text(text):if pd.isna(____): # Hint: current text inputreturn [] tokens = word_tokenize(text.____()) # Hint: lowercase text# Filter tokens using clinical stopwords tokens = [t for t in tokens if t notin all_stopwords and t.isalpha() andlen(t) > ____] # Hint: minimum lengthreturnlist(____(tokens, 2)) # Hint: bigram helperall_bigrams = []for text in mt_samples['transcription']: all_bigrams.extend(get_bigrams_from_text(text))bigram_counts = ____ # Hint: count bigramstop_20_bigrams = bigram_counts.most_common(20)# Create dataframe for bigramsbigram_df = pd.DataFrame([ {'bigram': ' '.join(bg), 'count': count}for bg, count in top_20_bigrams])(ggplot(____, aes(x='reorder(bigram, count)', y='count')) # Hint: bigram dataframe+ geom_bar(stat='identity', fill='steelblue')+ coord_flip()+ labs(x='Bigram', y='Frequency', title='Top 20 Bigrams')+ theme_minimal()+ theme(figure_size=(10, 6)))
Summarize: do the bigrams make sense
Question 5: Examining bigram words
Using the results from the bigram, pick a word and count the words that appear before and after it, and create a plot of the top 20.
# Pick a word to examine (e.g., 'blood', 'operating', 'diagnosis')target_word ='____'# Hint: choose a common token# Find bigrams containing the target wordbefore_words = []after_words = []for bigram, count in bigram_counts.items():if bigram[1] == ____: # Hint: compare to chosen token before_words.extend([bigram[0]] * count)if bigram[0] == ____: # Hint: compare to chosen token after_words.extend([bigram[1]] * count)# Count wordsbefore_counts = Counter(before_words).____(20) # Hint: top Nafter_counts = Counter(after_words).____(20) # Hint: top N# Create dataframesbefore_df = pd.DataFrame(____, columns=['word', 'count']) # Hint: counts before targetbefore_df['position'] ='before'after_df = pd.DataFrame(____, columns=['word', 'count']) # Hint: counts after targetafter_df['position'] ='after'
# Plot words before(ggplot(____, aes(x='reorder(word, count)', y='count')) # Hint: dataframe of words before+ geom_bar(stat='identity', fill='steelblue')+ coord_flip()+ labs(x='Word', y='Frequency', title=f'Words Before "{____}"') # Hint: chosen word+ theme_minimal()+ theme(figure_size=(7, 6)))
# Plot words after(ggplot(____, aes(x='reorder(word, count)', y='count')) # Hint: dataframe of words after+ geom_bar(stat='identity', fill='coral')+ coord_flip()+ labs(x='Word', y='Frequency', title=f'Words After "{____}"') # Hint: chosen word+ theme_minimal()+ theme(figure_size=(7, 6)))
Briefly summarize the bigram before and after words.
Question 6: Sentence Tokenization
Tokenize the transcriptions into sentences and analyze sentence-level statistics.
Count the number of sentences per transcription
Calculate the average sentence length (in words) per transcription
Plot the distribution of average sentence length
Make a boxplot of the average sentence length by specialty
# Tokenize transcriptions into sentencesdef get_sentence_stats(text):if pd.isna(text):return {'num_sentences': 0, 'avg_sentence_length': 0} sentences = ____(text) # Hint: sentence tokenizer num_sentences =len(sentences)if num_sentences ==0:return {'num_sentences': 0, 'avg_sentence_length': 0}# Calculate average sentence length in words sentence_lengths = [len(word_tokenize(sent)) for sent in sentences] avg_length = np.mean(sentence_lengths)return {'num_sentences': ____, 'avg_sentence_length': ____} # Hint: use computed values# Apply to all transcriptionssentence_stats = mt_samples['transcription'].apply(get_sentence_stats)mt_samples['num_sentences'] = ____.apply(lambda x: x['num_sentences']) # Hint: stats seriesmt_samples['avg_sentence_length'] = sentence_stats.apply(lambda x: x['avg_sentence_length'])# Summary statisticsprint("Sentence Statistics Summary:")print(mt_samples[['num_sentences', 'avg_sentence_length']].describe())# Aggregate sentence stats by specialty for boxplotspecialty_sentence_stats = ( mt_samples[mt_samples['avg_sentence_length'] >0] .groupby('medical_specialty', as_index=False)['avg_sentence_length'] .mean())
(ggplot(____, aes(x='reorder(medical_specialty, avg_sentence_length)', y='avg_sentence_length')) # Hint: aggregated by specialty+ geom_boxplot(fill='coral', alpha=0.7)+ coord_flip()+ labs(x='Medical Specialty', y='Average Sentence Length (words)', title='Average Sentence Length by Medical Specialty')+ theme_minimal()+ theme(figure_size=(10, 6)))
Summarize: Do you notice any patterns sentence length across different medical specialties?
Question 7: Topic Models
See if there are any themes in the data by using a topic model (LDA). - Remove combined NLTK + clinical stopwords - Use scikit-learn’s CountVectorizer to create a document-term matrix - Use LatentDirichletAllocation for topic modeling - Try different k (n_components) values (try 3 and 5) - Create a visualization of the topics
# Prepare text datatexts = mt_samples['transcription'].dropna().tolist()vectorizer = CountVectorizer( max_df=0.95, # Ignore terms that appear in >95% of documents (filters common words) min_df=2, # Ignore terms that appear in <2 documents (ignores very rare words) stop_words=list(all_stopwords), max_features=1000)dtm = vectorizer.fit_transform(texts)
n_topics = ____ # Hint: number of topicslda = LatentDirichletAllocation( n_components=____, # Hint: use topic count random_state=42, max_iter=10)lda.fit(dtm)feature_names = vectorizer.get_feature_names_out()# Extract top words for each topic into a dataframedef get_topic_df(model, feature_names, n_top_words=10): rows = []for topic_idx, topic inenumerate(model.components_): top_indices = topic.argsort()[:-n_top_words -1:-1]for rank, idx inenumerate(top_indices): rows.append({'topic': f'Topic {topic_idx +1}','word': feature_names[idx],'weight': topic[idx],'rank': rank })return pd.DataFrame(rows)topic_df = get_topic_df(____, ____, ____) # Hint: model, vocab list, top words# Display top words per topicfor topic in topic_df['topic'].unique(): words = topic_df[topic_df['topic'] == topic]['word'].tolist()print(f"{topic}: {', '.join(words)}")# Visualize topics with faceted plot(ggplot(____, aes(x='reorder(word, weight)', y='weight')) # Hint: topic dataframe+ geom_bar(stat='identity', fill='steelblue')+ coord_flip()+ facet_wrap('~topic', scales='free_y', ncol=3)+ labs(x='Word', y='Weight', title='Top Words per Topic (LDA)')+ theme_minimal()+ theme(figure_size=(14, 8), strip_text=element_text(size=10)))
Summarize: is it clearer when there are 3 or 5 topics?
Now do a cross tab of the 5 topics and the medical specialties.
# Assign documents to topics and compare with actual specialtiesdoc_topic_dist = lda.transform(dtm)dominant_topics = doc_topic_dist.argmax(axis=1)# Add to dataframemt_samples_with_topics = mt_samples.dropna(subset=['transcription']).copy()mt_samples_with_topics['dominant_topic'] = [f'Topic {t+1}'for t in dominant_topics]# Cross-tabulation of topics vs specialties (top 5 specialties)top_5_specialties = mt_samples_with_topics['medical_specialty'].value_counts().head(____).index.tolist() # Hint: number of specialtiescross_tab = pd.crosstab( mt_samples_with_topics[mt_samples_with_topics['medical_specialty'].isin(____)]['medical_specialty'], mt_samples_with_topics[mt_samples_with_topics['medical_specialty'].isin(____)]['dominant_topic'])print("\nCross-tabulation of Topics vs Top 5 Specialties:")print(cross_tab)
Summarize: What themes emerge from the topic modeling? Do the topics align with the medical specialties?