Using Elasticsearch as a caching layer
Our ultimate goal is to train a new classifier at each batch (every 15 minutes). However, the classifier will be trained using more than just the few records we downloaded within that current batch. We somehow have to cache the text content over a larger period of time (set to 24h) and retrieve it whenever we need to train a new classifier. With Larry Wall's quote in mind, we will try to be as lazy as possible maintaining the data consistency over this online layer. The basic idea is to use a Time to live (TTL) parameter that will seamlessly drop any outdated record. The Cassandra database provides this feature out of the box (so does HBase or Accumulo), but Elasticsearch is already part of our core architecture and can easily be used for that purpose. We will create the following mapping for the gzet
/twitter
index with the _ttl
parameter enabled:
$ curl -XPUT 'http://localhost:9200/gzet' $ curl -XPUT 'http://localhost:9200/gzet/_mapping/twitter' -d...