Let's search in the content field of the documents that we have for the e-mail address <[email protected]>
:
{ "query" : { "match" : { "content" : "[email protected]" } } }
Incidentally, Document 1 and Document 2 matched our query rather than just Document 1.
Let's see why this happened and how:
By default, the standard analyzer is taken as the default analyzer
The standard analyzer breaks
<[email protected]>
into malhotra and gmail.comThe standard analyzer also breaks the e-mail ID
<[email protected]>
into buygroceries and gmail.comThis means that when we search for the e-mail ID
<[email protected]>
, either malhotra or gmail.com needs to match for the document to be qualified as a result
Hence, both Document 1 and Document 2 matched our query rather than just Document 1.
The solution for this problem is to use the UAX Email URL tokenizer rather than the default tokenizer. This tokenizer preserves...