Once the crawling is finished, we have all the data in the MongoDB database. We can now query the database to put all the posts into a pandas dataframe:
import pandas as pd from pymongo import MongoClient client = MongoClient('HOST:PORT') db = client.teamspeed collection = db.forum_teamspeed dataset = [] for element in collection.find(): dataset.append(element) df = pd.DataFrame(dataset)
At this stage, we will also create a new column called full_verbatim
, where we concatenate the subject (thread title) and post content:
df['full_verbatim'] = df.apply(lambda x: x['subject'] + " " + x['post'],axis=1)
There exists a direct link between thread title and post, so the textual data included in both variables might be insightful with respect to a single thought of the forum user. It will help us to capture the broader and contextual meaning of the ideas expressed in forum posts.