Recipe 5 – extracting named entities
Reconciliation works great for those fields in your dataset that contain single terms, such as names of people, countries, or works of art. However, if your column contains running text, then reconciliation cannot help you, since it can only search for single terms in the datasets it uses. Fortunately, another technique called named-entity extraction
can help us. An extraction algorithm searches texts for named entities which are text elements, such as names of persons, locations, values, organizations, and other widely-known things. In addition to just extracting the terms, most algorithms also try to perform disambiguation. For instance, if the algorithm finds Washington in a text, it will try to determine whether the city or the person is mentioned. This saves us from having to perform reconciliation on the extracted terms.
OpenRefine does not support named-entity recognition natively, but the Named-Entity Recognition extension adds this for you. Before...