1

I have an index that contains documents structured as follows:

{
    "year": 2020,
    "month": 10,
    "day": 05,
    "some_other_data": { ... }
}

the ID of each documents is constructed based on the date and some additional data from some_other_data document, like this: _id: "20201005_some_other_unique_data". There is no explicit _timestamp on the documents.

I can easily get the most recent additions by doing the following query:

{
    "query": {
        "match_all": {}
    },
    "sort": [
        {"_uid": "desc"}
    ]
}

Now, the question is: how do I get documents that have essentially a date between day A and day B, where A is, for instance, 2020-07-12 and B is 2020-09-11. You can assume that the input date can be either integers, strings, or anything really as I can manipulate it beforehand.

edit: As requested, I'm including a sample result from the following query:

{
    "size": 4,
    "query": {
        "match": {
            "month": 7
        }
    },
    "sort": [
        {"_uid": "asc"}
    ]
}

The response:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1609,
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_andromeda_cryptic",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "Andromeda",
          },
          "parent_yara": {
            "strain": "CrypticMut",
          },
        },
        "sort": [
          "nested#20200703_andromeda_cryptic"
        ]
      },
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_betabot_boaxxe",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "BetaBot",
          },
          "parent_yara": {
            "strain": "Boaxxe",
          },
        },
        "sort": [
          "nested#20200703_betabot_boaxxe"
        ]
      },
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_darkcomet_zorex",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "DarkComet",
          },
          "parent_yara": {
            "strain": "Zorex",
          },
        },
        "sort": [
          "nested#20200703_darkcomet_zorex"
        ]
      },
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_darktrack_fake_template",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "Darktrack",
          },
          "parent_yara": {
            "strain": "CrypticFakeTempl",
          },
        },
        "sort": [
          "nested#20200703_darktrack_fake_template"
        ]
      }
    ]
  }
}

The above-mentioned query will return all documents that have matched the month. So basically anything that was put there in July of any year. What I want to achieve, if at all possible, is getting all documents inserted after a certain date and before another certain date.

Unfortunately, I cannot migrate the data so that it has a timestamp or otherwise nicely sortable fields. Essentially, I need to figure out a logic that will say: give me all documents inserted after july 1st, and before august 2nd. The problem here, is that there are plenty of edge cases, like how to do it when start date and end date are in different years, different months, and so on.

edit: I have solved it using the painless scripting, as suggested by Briomkez, with small changes to the script itself, as follows:

  getQueryForRange(dateFrom: String, dateTo: String, querySize: Number) {
    let script = `
      DateTimeFormatter formatter = new DateTimeFormatterBuilder().appendPattern("yyyy-MM-dd")
                                      .parseDefaulting(ChronoField.NANO_OF_DAY, 0)
                                      .toFormatter()
                                      .withZone(ZoneId.of("Z"));
      ZonedDateTime l = ZonedDateTime.parse(params.l, formatter);
      ZonedDateTime h = ZonedDateTime.parse(params.h, formatter);
      ZonedDateTime x = ZonedDateTime.of(doc['year'].value.intValue(), doc['month'].value.intValue(), doc['day'].value.intValue(), 0, 0, 0, 0, ZoneId.of('Z'));

      ZonedDateTime first = l.isAfter(h) ? h : l;
      ZonedDateTime last = first.equals(l) ? h : l;
      return (x.isAfter(first) || x.equals(first)) && (x.equals(last) || x.isBefore(last));
    `

    return {
      size: querySize,
      query: {
        bool: {
          filter: {
            script: {
              script: {
                source: script,
                lang: "painless",
                params: {
                  l: dateFrom,
                  h: dateTo,
                },
              },
            },
          },
        },
      },
      sort: [{ _uid: "asc" }],
    }
  }

With these changes, the query works well for my version of Elasticsearch (7.2) and the order of dates in not important.

1
  • You'll need the date formats beforehand -- there are no assumptions when constructing the queries... Please share a few concrete doc examples to narrow the question's scope. Commented Nov 11, 2020 at 20:00

1 Answer 1

2

I see (at least) two alternatives here. Either use script query or simple bool queries.

A. USE SCRIPT QUERIES

Basically, the idea is to build the a timestamp at query time, by exploiting the datetime in painless.

{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "<INSERT-THE-SCRIPT-HERE>",
            "lang": "painless",
            "params": {
              "l": "2020-07-12",
              "h": "2020-09-11" 
            }
          }
        }
      }
    }
  }
}

The script can be the following one:

// Building a ZonedDateTime from params.l
ZonedDateTime l = ZonedDateTime.parse(params.l,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from params.h
ZonedDateTime h = ZonedDateTime.parse(params.h,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from the doc
ZonedDateTime doc_date = ZonedDateTime.of(doc['year'].value, doc['month'].value, doc['day'].value, 0, 0, 0, 0, ZoneId.of('Z'));

return (x.isAfter(l) || x.equals(l)) && (x.equals(h) || x.isBefore(h));

B. ALTERNATIVE: splitting the problem in its building blocks

Let us denote with x the document you are searching and let us denote l and h be our lower date and higher date. Let us denote with x.year, x.month and x.day to access the subfield.

So x is contained in the range (l, h) iff

  1. [Condition-1] l <= x AND
  2. [Condition-2] x <= h

The first condition is met if the disjunction of the following conditions holds:

  1. [Condition-1.1] l.year < x.year
  2. [Condition-1.2] l.year == x.year AND l.month < x.month
  3. [Condition-1.3] l.year == x.year AND l.month == x.month AND l.day <= x.day

Similarly, the second condition can be expressed as the disjunction of the following conditions:

  1. [Condition-2.1] h.year > x.year
  2. [Condition-2.2] h.year == x.year AND h.month > x.month
  3. [Condition-2.3] h.year == x.year AND h.month == x.month AND h.day <= x.day

It remains to express these conditions in Elasticsearch DSL:

B-1. Using script query

Given this idea we can write a simple script query. We should substitute

{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "<INSERT SCRIPT HERE>",
            "lang": "painless",
            "params": {
              "l": {
                "year": 2020,
                "month": 07,
                "day": 01
              },
              "h": {
                "year": 2020,
                "month": 09,
                "day": 01
            }
          }
        }
      }
    }
  }
}

In painless you can express the Condition, considering that:

  • x.year is doc['year'].value, x.month is doc['month'].value, x.day is doc['day'].value
  • h.year is params.h.year, etc.
  • l.year is params.l.year, etc.

B-2. Using boolean query

Now we should transform these conditions into a bool conditions. The pseudo-code is the following:

{
  "query": {
    "bool": {
      // AND of two conditions
      "must": [
       
        {
          // Condition 1
        },
       
        {
          // Condition 2
        }
      ]
    }
  }
}

Each Condition-X block will look like this:

{
  "bool": {
    // OR
    "should": [
     { // Condition-X.1 },
     { // Condition-X.2 },
     { // Condition-X.3 },
    ],
    "minimum_should_match" : 1
  }
}

So, for example, we can express [Condition-2-3] with h = 2020-09-11 we can use this range query:

{
  "bool": {
    "must": [
      {
        "range": {
          "year": {
            "gte": 2020,
            "lte": 2020
          }
        }
      },
      {
        "range": {
          "month": {
            "gte": 9,
            "lte": 9
          }
        }
      },
      {
        "range": {
          "day": {
            "lte": 11
          }
        }
      }
    ]
  }
}

Write the entire query is feasible, but I think it would be very long :)

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your suggestion. I really like the script one, as it is way more concise than the bool expressions. I had to modify the script a little bit, for it to work with my elasticsearch and to make the order of dates irrelevant:

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.