In classification, we want to find the corresponding classes, sometimes also called labels, for given data instances. To be able to achieve this, we need to answer two questions:
How should we represent the data instances?
Which model or structure should our classifier possess?
In its simplest form, in our case, the data instance is the text of the answer and the label would be a binary value indicating whether the asker accepted this text as an answer or not. Raw text, however, is a very inconvenient representation to process for most machine learning algorithms. They want numbers. And it will be our task to extract useful features from the raw text, which the machine learning algorithm can then use to learn the right label for it.