Given some descriptors of a sequence of packets, flowing to/from a host connected to the Internet, the goal of this problem is to detect whether that sequence signals a malicious attack or not. If it does, we should also classify the type of attack. That's a multiclass classification problem, since the possible labels are multiple ones.
For each observation, 42 features are revealed: please note that some of them are categorical, whereas others are numerical. The dataset is composed of almost 5 million observations (but in this exercise we're using just the first million, to avoid memory constraints), and the number of possible labels is 23. One of them represents a non-malicious situation (normal); all the others represent 22 different network attacks. Some attention should be paid to the fact that the frequencies of response classes are imbalanced: for some attacks there are multiple observations, for others just a few.
No instruction is...