It is always important to first analyze any dataset before applying models on that same dataset
This section will require importing functions
from pyspark.sql
to be performed on our dataframe.
import pyspark.sql.functions as F
The following section walks through the steps to profile the text data.
- Execute the following script to group the
label
column and to generate a count distribution:
df.groupBy("label") \ .count() \ .orderBy("count", ascending = False) \ .show()
- Add a new column,
word_count
, to the dataframe,df
, using the following script:
import pyspark.sql.functions as F df = df.withColumn('word_count', F.size(F.split(F.col('response_text'),' ')))
- Aggregate the average word count,
avg_word_count
, bylabel
using the following script:
df.groupBy('label')\ .agg(F.avg('word_count').alias('avg_word_count'))\ .orderBy('avg_word_count', ascending = False) \ .show()