#### Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.
Applied Unsupervised Learning with Python
Preface
Free Chapter
Introduction to Clustering
Hierarchical Clustering
Neighborhood Approaches and DBSCAN
Dimension Reduction and PCA
Autoencoders
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Topic Modeling
Hotspot Analysis

## Chapter 7: Topic Modeling

Solution:

1. Import the necessary libraries:

```import langdetect
import matplotlib.pyplot
import nltk
import numpy
import pandas
import pyLDAvis
import pyLDAvis.sklearn
import regex
import sklearn```

### Note

Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.

```path = '<Path>/latimeshealth.txt'
df.columns = ["id", "datetime", "tweettext"]```
3. Run a quick exploratory analysis to ascertain the data size and structure:

```def dataframe_quick_look(df, nrows):
print("SHAPE:\n{shape}\n".format(shape=df.shape))
print("COLUMN NAMES:\n{names}\n".format(names=df.columns))

dataframe_quick_look(df, nrows=2)```

The output is as follows:

Figure 7.54: Shape, column names, and head of data

4. Extract the tweet text and convert it to a list object:

```raw = df['tweettext'].tolist()
print("LENGTH:\n{length}\n".format(length=len(raw)))```

The output is as follows:

Figure 7.55: Headlines and their length

5. Write a function to perform language detection, tokenization on whitespaces, and replace screen names and URLs with SCREENNAME and URL, respectively. The function should also remove punctuation, numbers, and the SCREENNAME and URL replacements. Convert everything to lowercase, except SCREENNAME and URL. It should remove all stop words, perform lemmatization, and keep words with five or more letters:

### Note

```def do_language_identifying(txt):
try:
the_language = langdetect.detect(txt)
except:
the_language = 'none'
return the_language
def do_lemmatizing(wrd):
out = nltk.corpus.wordnet.morphy(wrd)
return (wrd if out is None else out)
def do_tweet_cleaning(txt):
# identify language of tweet
# return null if language not english
lg = do_language_identifying(txt)
if lg != 'en':
return None
# split the string on whitespace
out = txt.split(' ')
# identify screen names
# replace with SCREENNAME
out = ['SCREENNAME' if i.startswith('@') else i for i in out]
# identify urls
# replace with URL
out = ['URL' if bool(regex.search('http[s]?://', i)) else i for i in out]
# remove all punctuation
out = [regex.sub('[^\\w\\s]|\n', '', i) for i in out]
# make all non-keywords lowercase
keys = ['SCREENNAME', 'URL']
out = [i.lower() if i not in keys else i for i in out]
# remove keywords
out = [i for i in out if i not in keys]
# remove stopwords
list_stop_words = nltk.corpus.stopwords.words('english')
list_stop_words = [regex.sub('[^\\w\\s]', '', i) for i in list_stop_words]
out = [i for i in out if i not in list_stop_words]
# lemmatizing
out = [do_lemmatizing(i) for i in out]
# keep words 4 or more characters long
out = [i for i in out if len(i) >= 5]
return out```
6. Apply the function defined in step 5 to every tweet:

`clean = list(map(do_tweet_cleaning, raw))`
7. Remove elements of output list equal to None:

```clean = list(filter(None.__ne__, clean))
print("LENGTH:\n{length}\n".format(length=len(clean)))```

The output is as follows:

Figure 7.56: Headline and length after removing None

8. Turn the elements of each tweet back into a string. Concatenate using white space:

```clean_sentences = [" ".join(i) for i in clean]
print(clean_sentences[0:10])```

The first 10 elements of the output list should resemble the following:

Figure 7.57: Tweets cleaned for modeling

9. Keep the notebook open for future modeling.

### Activity 16: Latent Dirichlet Allocation and Health Tweets

Solution:

1. Specify the number_words, number_docs, and number_features variables:

```number_words = 10
number_docs = 10
number_features = 1000```
2. Create a bag-of-words model and assign the feature names to another variable for use later on:

```vectorizer1 = sklearn.feature_extraction.text.CountVectorizer(
analyzer=»word»,
max_df=0.95,
min_df=10,
max_features=number_features
)
clean_vec1 = vectorizer1.fit_transform(clean_sentences)
print(clean_vec1[0])

feature_names_vec1 = vectorizer1.get_feature_names()```

The output is as follows:

`(0, 320)    1`
3. Identify the optimal number of topics:

```def perplexity_by_ntopic(data, ntopics):
output_dict = {
«Number Of Topics": [],
«Perplexity Score»: []
}
for t in ntopics:
lda = sklearn.decomposition.LatentDirichletAllocation(
n_components=t,
learning_method="online",
random_state=0
)
lda.fit(data)
output_dict["Number Of Topics"].append(t)
output_dict["Perplexity Score"].append(lda.perplexity(data))
output_df = pandas.DataFrame(output_dict)
index_min_perplexity = output_df["Perplexity Score"].idxmin()
output_num_topics = output_df.loc[
index_min_perplexity,  # index
«Number Of Topics"  # column
]
return (output_df, output_num_topics)
df_perplexity, optimal_num_topics = perplexity_by_ntopic(
clean_vec1,
ntopics=[i for i in range(1, 21) if i % 2 == 0]
)
print(df_perplexity)```

The output is as follows:

Figure 7.58: Number of topics versus perplexity score data frame

4. Fit the LDA model using the optimal number of topics:

```lda = sklearn.decomposition.LatentDirichletAllocation(
n_components=optimal_num_topics,
learning_method="online",
random_state=0
)
lda.fit(clean_vec1)```

The output is as follows:

Figure 7.59: LDA model

5. Create and print the word-topic table:

```def get_topics(mod, vec, names, docs, ndocs, nwords):
# word to topic matrix
W = mod.components_
W_norm = W / W.sum(axis=1)[:, numpy.newaxis]
# topic to document matrix
H = mod.transform(vec)
W_dict = {}
H_dict = {}
for tpc_idx, tpc_val in enumerate(W_norm):
topic = «Topic{}".format(tpc_idx)
# formatting w
W_indices = tpc_val.argsort()[::-1][:nwords]
W_names_values = [
(round(tpc_val[j], 4), names[j])
for j in W_indices
]
W_dict[topic] = W_names_values
# formatting h
H_indices = H[:, tpc_idx].argsort()[::-1][:ndocs]
H_names_values = [
(round(H[:, tpc_idx][j], 4), docs[j])
for j in H_indices
]
H_dict[topic] = H_names_values
W_df = pandas.DataFrame(
W_dict,
index=["Word" + str(i) for i in range(nwords)]
)
H_df = pandas.DataFrame(
H_dict,
index=["Doc" + str(i) for i in range(ndocs)]
)
return (W_df, H_df)

W_df, H_df = get_topics(
mod=lda,
vec=clean_vec1,
names=feature_names_vec1,
docs=raw,
ndocs=number_docs,
nwords=number_words
)
print(W_df)```

The output is as follows:

Figure 7.60: Word-topic table for the health tweet data

6. Print the document-topic table:

`print(H_df)`

The output is as follows:

Figure 7.61: Document topic table

7. Create a biplot visualization:

```lda_plot = pyLDAvis.sklearn.prepare(lda, clean_vec1, vectorizer1, R=10)
pyLDAvis.display(lda_plot)```

Figure 7.62: A histogram and biplot for the LDA model trained on health tweets

8. Keep the notebook open for future modeling.

### Activity 17: Non-Negative Matrix Factorization

Solution:

1. Create the appropriate bag-of-words model and output the feature names as another variable:

```vectorizer2 = sklearn.feature_extraction.text.TfidfVectorizer(
analyzer="word",
max_df=0.5,
min_df=20,
max_features=number_features,
smooth_idf=False
)
clean_vec2 = vectorizer2.fit_transform(clean_sentences)
print(clean_vec2[0])

feature_names_vec2 = vectorizer2.get_feature_names()```
2. Define and fit the NMF algorithm using the number of topics (n_components) value from activity two:

```nmf = sklearn.decomposition.NMF(
n_components=optimal_num_topics,
init="nndsvda",
solver="mu",
beta_loss="frobenius",
random_state=0,
alpha=0.1,
l1_ratio=0.5
)
nmf.fit(clean_vec2)```

The output is as follows:

Figure 7.63: Defining the NMF model

3. Get the topic-document and word-topic result tables. Take a few minutes to explore the word groupings and try to define the abstract topics:

```W_df, H_df = get_topics(
mod=nmf,
vec=clean_vec2,
names=feature_names_vec2,
docs=raw,
ndocs=number_docs,
nwords=number_words
)

print(W_df)```

Figure 7.64: The word-topic table with probabilities

4. Adjust the model parameters and rerun step 3 and step 4.