ElasticSearch: Retrieve string concatenation, or partial array

Question

I have many indexed documents such as this one:

{
   "_index":"myindex",
   "_type":"somedata",
   "_id":"31d3255d-67b4-40e6-b9d4-637383eb72ad",
   "_version":1,
   "_score":1,
   "_source":{
      "otherID":"b4c95332-daed-49ae-99fe-c32482696d1c",
      "data":[
         {
            "data":"d2454d41-a74e-43af-b3b0-0febeaf67a99",
            "iD":"9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
         },
         {
            "data":"some text",
            "iD":"c554b8ce-c873-4fef-b306-ec65d2f40394"
         },
         {
            "data":"5256983c-ef69-4363-9787-97074297c646",
            "iD":"8c90e2be-6042-4450-b0fd-0732900f8f65"
         },
         {
            "data":"other text",
            "iD":"8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
         },
         {
            "data":"3",
            "iD":"c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
         }
      ],
      "iD":"31d3255d-67b4-40e6-b9d4-637383eb72ad"
   }
}

It's type mapping is configured this way:

{
   "somedata":{
      "dynamic_templates":[
         {
            "defaultIDs":{
               "match_pattern":"regex",
               "mapping":{
                  "index":"not_analyzed",
                  "type":"string"
               },
               "match":".*(id|ID|iD)"
            }
         }
      ],
      "properties":{
         "otherID":{
            "index":"not_analyzed",
            "type":"string"
         },
         "data":{
            "properties":{
               "data":{
                  "type":"string"
               },
               "iD":{
                  "index":"not_analyzed",
                  "type":"string"
               }
            }
         },
         "iD":{
            "index":"not_analyzed",
            "type":"string"
         }
      }
   }
}

I wish to be able to retrieve a string concatenation of data based on it's ID.
For example, given the id c554b8ce-c873-4fef-b306-ec65d2f40394, and the id 8d8f8a61-02d6-4d3e-9912-9ebb5d213c15, I would like to retrieve some text other text.
These IDs repeat in other documents of the same type with different data.

If this is not possible (which I suspect this is the case), I would like to at least retrieve a partial array containing my requested data.
Those arrays can become large (and so is the number of documents) and I would only need one or two elements from each hit.

If both my requests are not possible, how would you suggest changing my mappings in order to facilitate my needs?

Thanks in advance, Jonathan.

Jony Adamit · Accepted Answer · 2015-05-19 08:44:40Z

3

I have found a way to do exactly what I needed without changing my data structure.
(I actually did end up changing my data structure, but for reasons of space and efficiency).

All you have to do is enjoy the groovy goodness ElasticSearch has to offer:

{
    "query" : { "term" : { "otherID" : "b4c95332-daed-49ae-99fe-c32482696d1c" } },
    "script_fields" : { "requestedFields" : { "script" :  "_source.data.findAll({ it.iD == 'c554b8ce-c873-4fef-b306-ec65d2f40394' || it.iD == '8d8f8a61-02d6-4d3e-9912-9ebb5d213c15'}) data.join(' ') " } }
}

Just goes to show how strong ElasticSearch really is.

answered May 19, 2015 at 8:44

Jony Adamit

3,44641 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mark · Accepted Answer · 2015-05-17 14:51:36Z

I cannot help you with the field concatenation (maybe it's possible with scripting but I'm not experienced enough with it. I would assume a new field would have to be generated, etc.) but how to only retrieve the partial data.

It requires at least ES 1.5 because it uses inner_hits and you need to change the mapping.

I added type and include_in_parent to your data type:

DELETE somedata
PUT somedata
PUT somedata/sometype/_mapping
{
   "sometype":{
      "dynamic_templates":[
         {
            "defaultIDs":{
               "match_pattern":"regex",
               "mapping":{
                  "index":"not_analyzed",
                  "type":"string"
               },
               "match":".*(id|ID|iD)"
            }
         }
      ],
      "properties":{
         "otherID":{
            "index":"not_analyzed",
            "type":"string"
         },
         "data":{
            "type": "nested",
            "include_in_parent": true,
            "properties":{
               "data":{
                  "type":"string"
               },
               "iD":{
                  "index":"not_analyzed",
                  "type":"string"
               }
            }
         },
         "iD":{
            "index":"not_analyzed",
            "type":"string"
         }
      }
   }
}

Now indexing your document:

PUT somedata/sometype/1
{
      "otherID":"b4c95332-daed-49ae-99fe-c32482696d1c",
      "data":[
         {
            "data":"d2454d41-a74e-43af-b3b0-0febeaf67a99",
            "iD":"9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
         },
         {
            "data":"some text",
            "iD":"c554b8ce-c873-4fef-b306-ec65d2f40394"
         },
         {
            "data":"5256983c-ef69-4363-9787-97074297c646",
            "iD":"8c90e2be-6042-4450-b0fd-0732900f8f65"
         },
         {
            "data":"other text",
            "iD":"8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
         },
         {
            "data":"3",
            "iD":"c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
         }
      ],
      "iD":"31d3255d-67b4-40e6-b9d4-637383eb72ad"
   }

And here's how you can match and retrieve with inner_hits:

POST somedata/sometype/_search
{
  "query": {
    "nested": {
      "path": "data",
      "query": {
        "bool": {
          "should": [
            {
            "term": {
              "data.iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
            }
            },
            {
            "term": {
              "data.iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
            }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

In the result now look at this path: hits.hits[0].inner_hits.data.hits.hits[0]._source.data; it only contains your two requested matches:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.5986179,
      "hits": [
         {
            "_index": "somedata",
            "_type": "sometype",
            "_id": "1",
            "_score": 0.5986179,
            "_source": {
               "otherID": "b4c95332-daed-49ae-99fe-c32482696d1c",
               "data": [
                  {
                     "data": "d2454d41-a74e-43af-b3b0-0febeaf67a99",
                     "iD": "9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
                  },
                  {
                     "data": "some text",
                     "iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
                  },
                  {
                     "data": "5256983c-ef69-4363-9787-97074297c646",
                     "iD": "8c90e2be-6042-4450-b0fd-0732900f8f65"
                  },
                  {
                     "data": "other text",
                     "iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
                  },
                  {
                     "data": "3",
                     "iD": "c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
                  }
               ],
               "iD": "31d3255d-67b4-40e6-b9d4-637383eb72ad"
            },
            "inner_hits": {
               "data": {
                  "hits": {
                     "total": 2,
                     "max_score": 0.5986179,
                     "hits": [
                        {
                           "_index": "somedata",
                           "_type": "sometype",
                           "_id": "1",
                           "_nested": {
                              "field": "data",
                              "offset": 3
                           },
                           "_score": 0.5986179,
                           "_source": {
                              "data": "other text",
                              "iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
                           }
                        },
                        {
                           "_index": "somedata",
                           "_type": "sometype",
                           "_id": "1",
                           "_nested": {
                              "field": "data",
                              "offset": 1
                           },
                           "_score": 0.5986179,
                           "_source": {
                              "data": "some text",
                              "iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
                           }
                        }
                     ]
                  }
               }
            }
         }
      ]
   }
}

Now, inner_hits is fairly new and the documentation also states:

Warning: This functionality is experimental and may be changed or removed completely in a future release.

YMMV.

Another thing to watch out: the inner_hits are sorted by score. In your original document they're in an array which is ordered but that information is lost in the actual result. If you require to have them in the same order in the inner_hits, I think you need to add a separate field for sorting (could just be the array index...) and sort the inner_hits by it.

Thank you for taking the time to respond @mark. You certainly pointed me in the right direction. It's a bit discouraging to see that warning though. And it seems getting that concatenation would be so complex, if it's even possible, that it's not worth it.. sigh If no other surprising answer comes along in the next hours I'll accept yours. Thanks again :-)
Without knowing your full intent with the data structure, maybe it makes sense to store it in a different way into ES, optimized for your case I.e. primarily indexing the data[] array as type and attaching iD and orderID to it and get away with a query without inner_hits (still, not solving the sorting and concatenation).
Thanks @mark. I realized my data structure needed a change for many reasons. Nevertheless I still need that concatenation ability and it turns out there are many ways to achieve this. I have posted the answer here if you're interested in knowing :)

Collectives™ on Stack Overflow

ElasticSearch: Retrieve string concatenation, or partial array

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related