Regular expressions are strings that describe a pattern using a specialized syntax of characters, and throughout this book, we will be learning about these different characters and codes that are used to match and manipulate different pieces of data in a vague sort of manner. Now, before we can attempt to create a regular expression, we need to be able to spot and describe these patterns (in English). Let's take a look at a few different and common examples and later on in the book, when we have a stronger grasp on the syntax, we will see how to represent these patterns in code.
We can describe this pattern as being three digits, a dash, then another three numbers, followed by a second dash, and finally four more numbers. It is pretty simple to do; we look at a string and describe how it is made up, and the preceding description will work perfectly if all your numbers follow the given pattern. Now, let's say, we add the following three phone numbers to this set:
123-123-1234 (123)-123-1234 1231231234
These are all valid phone numbers, and in your application, you probably want to be able to match all of them, giving the user the flexibility to write in whichever manner they feel most comfortable. So, let's have another go at our pattern. Now, I would say we have three numbers, optionally inside brackets, then an optional dash, another three numbers, followed by another optional dash, and finally four more digits. In this example, the only parts that are mandatory are the ten digits: the placing of dashes and brackets would completely be up to the user.
Notice also that we haven't put any constraints on the actual digits, and as a matter of fact, we don't even know what they will be, but we do know that they have to be numbers (as opposed to letters, for instance), so we've only placed this constraint:
Sometimes, we might have a more specific constraint than just a digit or a letter; in other cases, we may want a specific word or at least a word from a specific group. In these cases (and mostly with all patterns), the more specific you can be, the better. Let's take the following example:
[info] – App Started [warning] – Job Queue Full [info] – Client Connected [error] – Error Parsing Input [info] – Application Exited Successfully
This is an example of some sort of log, of course, and we can simply say that each line is a single log message. However, this doesn't help us if we want to manipulate or extract the data more specifically. Another option would be to say that we have some kind of word in brackets, which refers to the log level, and then a message after the dash, which will consist of any number of words. Again, this isn't too specific, and our application may only know how to handle the three preceding log levels, so, you may want to ignore everything else or raise an error.
To best describe the preceding pattern, we would say that you have a word, which can either be info, a warning, or an error inside a pair of square brackets, followed by a dash and then some sort of sentence, which makes up the log message. This will allow us to capture the information from the log more accurately and make sure our system is ready to handle the data before we send it:
<title>Demo</title> <size>45MB</size> <date>24 Dec, 2013</date>
We could just say that the pattern consists of a tag, some text, and a closing tag. This isn't really specific enough for it to be a valid XML, since the closing tag has to match the opening one. So, if we define the pattern again, we would say that it contains some text wrapped by an opening tag on the left-hand side and a matching closing tag on the right-hand side:
The last three examples were just used to get us into the Regex train of thought; these are just a few of the common types of patterns and constraints, which you can use in your own applications.