This section of this chapter will discuss the various data preparation and text preprocessing steps involved before feeding it into the model as input. The specific way we prepare the data really depends on how we intend to model it, which in turn depends on how we intend to use it.
The language model will be based on statistics and predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to the model, to, in turn, generate the next word.
A key decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict. This input length will also define the length of the seed text used to generate new sequences when we use the model.
For the purpose of simplicity, we will arbitrarily pick a length of 50 words for the length of the input sequences.