Book Image

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger
Book Image

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.
Table of Contents (12 chapters)
Applied Unsupervised Learning with Python
Preface

Chapter 7: Topic Modeling


Activity 15: Loading and Cleaning Twitter Data

Solution:

  1. Import the necessary libraries:

    import langdetect
    import matplotlib.pyplot
    import nltk
    import numpy
    import pandas
    import pyLDAvis
    import pyLDAvis.sklearn
    import regex
    import sklearn
  2. Load the LA Times health Twitter data (latimeshealth.txt) from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python/tree/master/Lesson07/Activity15-Activity17:

    Note

    Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.

    path = '<Path>/latimeshealth.txt'
    df = pandas.read_csv(path, sep="|", header=None)
    df.columns = ["id", "datetime", "tweettext"]
  3. Run a quick exploratory analysis to ascertain the data size and structure:

    def dataframe_quick_look(df, nrows):
    print("SHAPE:\n{shape}\n".format(shape=df.shape))
    print("COLUMN NAMES:\n{names}\n".format(names=df.columns))
    print("HEAD:\n{head}\n".format(head=df.head(nrows)))
    
    dataframe_quick_look(df, nrows=2)

    The output is as follows:

    Figure 7.54: Shape, column names, and head of data

  4. Extract the tweet text and convert it to a list object:

    raw = df['tweettext'].tolist()
    print("HEADLINES:\n{lines}\n".format(lines=raw[:5]))
    print("LENGTH:\n{length}\n".format(length=len(raw)))

    The output is as follows:

    Figure 7.55: Headlines and their length

  5. Write a function to perform language detection, tokenization on whitespaces, and replace screen names and URLs with SCREENNAME and URL, respectively. The function should also remove punctuation, numbers, and the SCREENNAME and URL replacements. Convert everything to lowercase, except SCREENNAME and URL. It should remove all stop words, perform lemmatization, and keep words with five or more letters:

    Note

    Screen names start with the @ symbol.

    def do_language_identifying(txt):
        	try:
               the_language = langdetect.detect(txt)
        	except:
            	the_language = 'none'
        	return the_language
    def do_lemmatizing(wrd):
        	out = nltk.corpus.wordnet.morphy(wrd)
        	return (wrd if out is None else out)
    def do_tweet_cleaning(txt):
    # identify language of tweet
    # return null if language not english
        	lg = do_language_identifying(txt)
        	if lg != 'en':
            	return None
    # split the string on whitespace
        	out = txt.split(' ')
    # identify screen names
    # replace with SCREENNAME
        	out = ['SCREENNAME' if i.startswith('@') else i for i in out]
    # identify urls
    # replace with URL
        	out = ['URL' if bool(regex.search('http[s]?://', i)) else i for i in out]
          # remove all punctuation
        	out = [regex.sub('[^\\w\\s]|\n', '', i) for i in out]
          # make all non-keywords lowercase
        	keys = ['SCREENNAME', 'URL']
        	out = [i.lower() if i not in keys else i for i in out]
          # remove keywords
        	out = [i for i in out if i not in keys]
          # remove stopwords
        	list_stop_words = nltk.corpus.stopwords.words('english')
        	list_stop_words = [regex.sub('[^\\w\\s]', '', i) for i in list_stop_words]
        	out = [i for i in out if i not in list_stop_words]
          # lemmatizing
        	out = [do_lemmatizing(i) for i in out]
          # keep words 4 or more characters long
        	out = [i for i in out if len(i) >= 5]
        	return out
  6. Apply the function defined in step 5 to every tweet:

    clean = list(map(do_tweet_cleaning, raw))
  7. Remove elements of output list equal to None:

    clean = list(filter(None.__ne__, clean))
    print("HEADLINES:\n{lines}\n".format(lines=clean[:5]))
    print("LENGTH:\n{length}\n".format(length=len(clean)))

    The output is as follows:

    Figure 7.56: Headline and length after removing None

  8. Turn the elements of each tweet back into a string. Concatenate using white space:

    clean_sentences = [" ".join(i) for i in clean]
    print(clean_sentences[0:10])

    The first 10 elements of the output list should resemble the following:

    Figure 7.57: Tweets cleaned for modeling

  9. Keep the notebook open for future modeling.

Activity 16: Latent Dirichlet Allocation and Health Tweets

Solution:

  1. Specify the number_words, number_docs, and number_features variables:

    number_words = 10
    number_docs = 10
    number_features = 1000
  2. Create a bag-of-words model and assign the feature names to another variable for use later on:

    vectorizer1 = sklearn.feature_extraction.text.CountVectorizer(
        analyzer=»word»,
        max_df=0.95, 
        min_df=10, 
        max_features=number_features
    )
    clean_vec1 = vectorizer1.fit_transform(clean_sentences)
    print(clean_vec1[0])
    
    feature_names_vec1 = vectorizer1.get_feature_names()

    The output is as follows:

    (0, 320)    1
  3. Identify the optimal number of topics:

    def perplexity_by_ntopic(data, ntopics):
        output_dict = {
            «Number Of Topics": [], 
            «Perplexity Score»: []
        }
        for t in ntopics:
            lda = sklearn.decomposition.LatentDirichletAllocation(
                n_components=t,
                learning_method="online",
                random_state=0
            )
            lda.fit(data)
            output_dict["Number Of Topics"].append(t)
            output_dict["Perplexity Score"].append(lda.perplexity(data))
        output_df = pandas.DataFrame(output_dict)
        index_min_perplexity = output_df["Perplexity Score"].idxmin()
        output_num_topics = output_df.loc[
            index_min_perplexity,  # index
            «Number Of Topics"  # column
        ]
        return (output_df, output_num_topics)
    df_perplexity, optimal_num_topics = perplexity_by_ntopic(
        clean_vec1, 
        ntopics=[i for i in range(1, 21) if i % 2 == 0]
    )
    print(df_perplexity)

    The output is as follows:

    Figure 7.58: Number of topics versus perplexity score data frame

  4. Fit the LDA model using the optimal number of topics:

    lda = sklearn.decomposition.LatentDirichletAllocation(
        n_components=optimal_num_topics,
        learning_method="online",
        random_state=0
    )
    lda.fit(clean_vec1)

    The output is as follows:

    Figure 7.59: LDA model

  5. Create and print the word-topic table:

    def get_topics(mod, vec, names, docs, ndocs, nwords):
        # word to topic matrix
        W = mod.components_
        W_norm = W / W.sum(axis=1)[:, numpy.newaxis]
        # topic to document matrix
        H = mod.transform(vec)
        W_dict = {}
        H_dict = {}
        for tpc_idx, tpc_val in enumerate(W_norm):
            topic = «Topic{}".format(tpc_idx)
            # formatting w
            W_indices = tpc_val.argsort()[::-1][:nwords]
            W_names_values = [
                (round(tpc_val[j], 4), names[j]) 
                for j in W_indices
            ]
            W_dict[topic] = W_names_values
            # formatting h
            H_indices = H[:, tpc_idx].argsort()[::-1][:ndocs]
            H_names_values = [
            (round(H[:, tpc_idx][j], 4), docs[j]) 
                for j in H_indices
            ]
            H_dict[topic] = H_names_values
        W_df = pandas.DataFrame(
            W_dict, 
            index=["Word" + str(i) for i in range(nwords)]
        )
        H_df = pandas.DataFrame(
            H_dict,
            index=["Doc" + str(i) for i in range(ndocs)]
        )
        return (W_df, H_df)
    
    W_df, H_df = get_topics(
        mod=lda,
        vec=clean_vec1,
        names=feature_names_vec1,
        docs=raw,
        ndocs=number_docs, 
        nwords=number_words
    )
    print(W_df)

    The output is as follows:

    Figure 7.60: Word-topic table for the health tweet data

  6. Print the document-topic table:

    print(H_df)

    The output is as follows:

    Figure 7.61: Document topic table

  7. Create a biplot visualization:

    lda_plot = pyLDAvis.sklearn.prepare(lda, clean_vec1, vectorizer1, R=10)
    pyLDAvis.display(lda_plot)

    Figure 7.62: A histogram and biplot for the LDA model trained on health tweets

  8. Keep the notebook open for future modeling.

Activity 17: Non-Negative Matrix Factorization

Solution:

  1. Create the appropriate bag-of-words model and output the feature names as another variable:

    vectorizer2 = sklearn.feature_extraction.text.TfidfVectorizer(
        analyzer="word",
        max_df=0.5, 
        min_df=20, 
        max_features=number_features,
        smooth_idf=False
    )
    clean_vec2 = vectorizer2.fit_transform(clean_sentences)
    print(clean_vec2[0])
    
    feature_names_vec2 = vectorizer2.get_feature_names()
  2. Define and fit the NMF algorithm using the number of topics (n_components) value from activity two:

    nmf = sklearn.decomposition.NMF(
        n_components=optimal_num_topics,
        init="nndsvda",
        solver="mu",
        beta_loss="frobenius",
        random_state=0, 
        alpha=0.1, 
        l1_ratio=0.5
    )
    nmf.fit(clean_vec2)

    The output is as follows:

    Figure 7.63: Defining the NMF model

  3. Get the topic-document and word-topic result tables. Take a few minutes to explore the word groupings and try to define the abstract topics:

    W_df, H_df = get_topics(
        mod=nmf,
        vec=clean_vec2,
        names=feature_names_vec2,
        docs=raw,
        ndocs=number_docs, 
        nwords=number_words
    )
    
    print(W_df)

    Figure 7.64: The word-topic table with probabilities

  4. Adjust the model parameters and rerun step 3 and step 4.