Book Image

Elasticsearch for Hadoop

By : Vishal Shukla
Book Image

Elasticsearch for Hadoop

By: Vishal Shukla

Overview of this book

Table of Contents (15 chapters)
Elasticsearch for Hadoop
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Write and query configurations


Here are the write and query configurations:

es.query

This defaults to none; that is, all the data under the Elasticsearch index and type is returned. Specifies the Elasticsearch query that is used when you read data from Elasticsearch, which can be in one of the three forms:

  • uri: This specifies the query string parameter, for example, q=category:InformationTechnology

  • query dsl: This specifies any Elasticsearch query. For example, consider the following code:

    {
      "query":
      {
        "match":["InformationTechnology"]
      }
    }
  • external resource: This points to a file that contains the uri or the query DSL, for example, /path/to/query.json

es.input.json

This defaults to false.

Specifies whether the input is already in the json format or not. The json should look similar to the following code:

[
  {
    "id": 10178221,
    "caseNumber": "HY366678",
    "eventDate": "08/02/15 23:58",
    "block": "042XX W MADISON ST",
    "iucr": 1811,
    "primaryType": "NARCOTICS",
    "description": "POSS: CANNABIS 30GMS OR LESS",
    "location": "SIDEWALK",
    "arrest": "TRUE",
    "domestic": "FALSE",
    "lat": 41.88076873,
    "lon": -87.73136165
  },
  {
  ..
  ..
}
]

es.write.operation

This defaults to index.

Specifies how the write to Elasticsearch if the ID of the incoming document already exists or doesn't exist in the Elasticsearch index. It can take four different values:

  • index: This specifies that a new document is added and the old document is updated

  • create: This indicates that a new document is added and throws an exception if a document with the same ID already exists

  • update: This throws an exception if the document doesn't already exist and updates it otherwise

  • upsert: This denotes that a new document is added and the old document is merged

If an update or upsert write operation is used, the following additional configurations can be applied:

es.update.script

This defaults to none.

Specifies the script that needs to be used in order to update the document.

es.update.script.lang

This defaults to none.

Specifies the script language.

es.update.script.params

This defaults to none.

Specifies the script parameters in the paramName:fieldname or paramName:<CONSTANT> format. It may be a comma-separated list.

es.update.script.params.json

This defaults to none.

If all parameters are constant, they can be specified in the json format. Consider the following example:

{
  "param1":1, 
  "param2":2
}

es.batch.size.bytes

This defaults to 1mb.

Size in bytes for batch writes with the Elasticsearch bulk API. The bulk size is allocated as per the task instance. It means that, if you have five tasks that run with 1mb batch size, you may have 5mb of data getting indexed at the same time in Elasticsearch.

es.batch.size.entries

This defaults to 1000.

Specifies the maximum number of entries in a batch write when you use the Elasticsearch bulk API. When this is used along with es.batch.size.bytes, when either of these two sizes is reached, the batch update is executed. Again, this setting applies to each task.

es.batch.write.refresh

This defaults to true. If a refresh should be executed on the completion of a batch write. This can be very useful when you are interested in analyzing the data being indexed in real time.

es.batch.write.retry.count

This defaults to 3.

Specifies the number of retries for a given batch. The retries are made for rejected data only. A negative value indicates infinite retries.

es.batch.write.retry.wait

This defaults to 10s.

Indicates the time to wait between two batch write retries.

es.ser.reader.value.class

Defaults depend on whether MapReduce, Cascading, Hive, Pig, Spark, or Storm is used. Specifies the ValueWriter implementation to convert objects to JSON.

es.ser.writer.value.class

The defaults depend on whether MapReduce, Cascading, Hive, Pig, Spark, or Storm is used. Specifies the ValueWriter implementation in order to convert objects to JSON.

es.update.retry.on.conflict

This defaults to 0. In a concurrent environment, this configuration can specify the number of retries when a conflict is detected.