Detecting the document language during indexation
Imagine a situation when you have users from different countries and you would like to give them a choice to only see content you index that is written in their native language. However, there is one problem; your documents don't have their language identified, so we need to do this ourselves. Let's see how we can identify the language of the documents during indexing time and store this information along with the documents in the index for later use.
How to do it...
For language identification, we will use one of the Solr contribution modules, but let's start from the beginning:
For the purpose of the recipe, I assume that we will use the following index structure (we just need to add the following to the
schema.xml
file):<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="description" type="text_general...