A stop word is a very common word used in the English language and is often removed from common NLP techniques because they can be distracting. Common stop word would be words such as the or and.
This section requires importing the following libraries:
from pyspark.ml.feature import StopWordsRemover from pyspark.ml import Pipeline
This section walks through the steps to remove stop words.
- Execute the following script to extract each word in
chat
into a string within an array:
df = df.withColumn('words',F.split(F.col('chat'),' '))
- Assign a list of common words to a variable,
stop_words
, that will be considered stop words using the following script:
stop_words = ['i','me','my','myself','we','our','ours','ourselves', 'you','your','yours','yourself','yourselves','he','him', 'his','himself','she','her','hers','herself','it','its', 'itself','they','them','their','theirs','themselves', 'what','which','who','whom','this','that','these','those...