0

Each of my records in Elasticsearch has an array of objects that looks like this:

{
  "counts_by_year": [
    {
      "year": 2022,
      "works_count": 22523,
      "cited_by_count": 18054
    },
    {
      "year": 2021,
      "works_count": 32059,
      "cited_by_count": 24817
    },
    {
      "year": 2020,
      "works_count": 27210,
      "cited_by_count": 30238
    },
    {
      "year": 2019,
      "works_count": 22592,
      "cited_by_count": 33631
    }
  ]
}

What I want to do is sort my records using the average of works_count where year is 2022 and year is 2021. Is this a case where I could use script based sorting? Or should I try to copy those values into a separate field and sort on that?

Edit - the mapping is:

{
  "mappings": {
    "_doc": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        .
        .
        .
        "counts_by_year": {
          "properties": {
            "cited_by_count": {
              "type": "integer"
            },
            "works_count": {
              "type": "integer"
            },
            "year": {
              "type": "integer"
            }
          }
        },
        .
        .
        .
      }
    }
  }
}

1 Answer 1

2

Tldr;

It depends. Most likely yes, except if count_by_year is nested.

Solution

Something along those lines should do the trick

GET /_search
{
  "sort": {
    "_script": {
      "type": "number",
      "script": {
        "lang": "painless",
        "source": "doc['counts_by_year.works_count'].stream().mapToLong(x -> x).average().orElse(0);"
      }
    }
  }
}

Solution (nested fields)

PUT 74404793-2
{
  "mappings": {
      "properties": {
        "counts_by_year": {
          "type": "nested", 
          "properties": {
            "cited_by_count": {
              "type": "long"
            },
            "works_count": {
              "type": "long"
            },
            "year": {
              "type": "long"
            }
          }
        }
      }
    }
}

POST /74404793-2/_doc/
{
  "counts_by_year": [
    {
      "year": 2022,
      "works_count": 22523,
      "cited_by_count": 18054
    },
    {
      "year": 2021,
      "works_count": 32059,
      "cited_by_count": 24817
    },
    {
      "year": 2020,
      "works_count": 27210,
      "cited_by_count": 30238
    },
    {
      "year": 2019,
      "works_count": 22592,
      "cited_by_count": 33631
    }
  ]
}

I am using the _source to access the documents, it can severely impact the performances if you have big documents.

GET 74404793-2/_search
{
  "sort": {
    "_script": {
      "type": "number",
      "script": {
        "lang": "painless",
        "source": """
        params._source['counts_by_year']
        .stream()
        .filter(x -> x['year'] > 2020)
        .mapToLong(x -> x['works_count'])
        .average().orElse(0);"""
      }
    }
  }
}
Sign up to request clarification or add additional context in comments.

7 Comments

This looks great. But I tried it and got the error "No field found for [counts_by_year] in mapping" even though counts_by_year definitely exists as an object type field.
Can you update your question with the mapping ? Also you can try to debug
Ok I updated the question with the mapping! I will try debug as well.
I should have fixed it
Awesome thank you! Do you know how I could make it use the first two array values? Or only the array values that are from year 2021 and 2022?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.