Storing the training data
First of all, you can use multiple AWS services to prepare data for machine learning, such as EMR, Redshift, Glue, and so on. After preprocessing the training data, you should store it in S3, in a format expected by the algorithm you are using. The following table shows the list of acceptable data formats per algorithm:
As we can see, many algorithms accept text/.csv
format. Keep in mind that you should follow these rules if you want to use that format:
- Your CSV file can't have a header record.
- For supervised learning, the target variable must be in the first column.
- While configuring the training pipeline, set the input data channel as
content_type
equal totext/csv
. - For unsupervised learning, set the
label_size
within the content_type to'content_type=text/csv;label_size=0'
.