4

Help me please. How to make the best document scheme for such data? There are data on the number of products in each city:

  1. product1, [city = city1, available = 0], [city = city2, available = 2], [city = city3, available = 1], ... ...
  2. product100, [city = city1, available = 1], [city = city2, available = 1], [city = city3, available = 1], ...

How can this data be saved for each of the products if the products can be 1000 and cities can be 100 and that the city-available search work?

2
  • Possible duplicate of How to use an Array mapping in ES? Commented Sep 7, 2017 at 5:17
  • What queries , aggregations , and how do you plan to display them in your app will determine the best way to store it.Can you specify these? Commented Sep 7, 2017 at 6:25

4 Answers 4

4

It completely depends on the way you want to query the data. When we store data as an array of objects, we lose correlation.
So if you store your data like-

prodId : id,
availability: [
    { city: city1, available: true},
    { city: city2, available: false}
   ]

ES will internally flatten the objects while indexing and it will be indexed as -

availability.city= [city1,city2]
availability.available= [true,false]

Now if you want to check for products which are available in city2, this document will qualify.

If you want to maintain the correlation, you should go with nested objects. Nested objects are considered as separate documents and managed internally by ES. The joins are performed internally by ES so you don't have to worry about it and you can run aggregations over it. On the down side, nested objects slow down the system as more shard level communication is required.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much for the ideas. For me, this is useful information. I still study a lot of documentation. I will test different solutions and learn.
2

Your dataset (1000 products/100 cities) is very small. If you do not expect it to scale to be much larger, you can probably use a nested data structure (which is the most obvious solution here). Your mapping would look something like this:

{
  "product": {
    "properties": {
      "product": {"type": "keyword"},
      "cities": {
        "type": "nested",
        "properties": {
          "name": {"type": "keyword"},
          "available": {"type": "integer"}
        }
      }
    }
  }
}

Then you would index documents that look like this:

{
  "product": "product1",
  "cities": [
    {
      "name": "city1",
      "available": 0
    },
    {
      "name": "city2",
      "available": 1
    }
  ]
}

However, nested queries and aggregations are expensive/slow, so if you expect your dataset to grow substantially, you may want to consider denormalizing your data. In your case, I can see a few possible ideas for this, which will depend on how you want to query your data.

Simple flattening (one doc per city/product combo):

Doc 1:
{
  "product": "product1",
  "city": "city1",
  "available": 0
}
Doc 2:
{
  "product": "product1",
  "city": "city2",
  "available": 1
}

The down side here is that you can't easily search by product (since the products are duplicated). You may be able to resolve that by keeping a separate index of products to query when you need to query in that way.

In case you never expect to get more cities than 100 (or 1000), you could have one field per city, like this:

{
  "product": "product1",
  "city1": 0,
  "city2": 1,
  ...
}

Note that in case you do this, you don't actually need to have all the cities in each source document -- missing keys are fine. The "down side" of this is that you need to know in advance the name of the cities you're interested in (in order to query), in order to query. Probably this is not the right solution for you, but it is useful in some use cases.

In case your available numbers are always low, and you expect this to always be the case (like if you never expect to have more than 10 available), you could do something like this:

{
  "product": "product1",
  "available": {
    "0": ["city1", "city2"],
    "1": ["city2"],
    "2": [],
    ...
  }
}

So if you want to see if city1 has the product (regardless of whether they're available), you can query available.0, and if you want to see if it has at least 1 available, you can query available.1, etc. If you want to see cities where product1 has at least 1 available, you can do a terms aggregation on available.1. In case you are using this kind of a data structure, you would probably want to add another field, which will contain the exact numbers for each city (not nested, so not very useful for querying, but for convenience after you've retrieved the data).

Comments

1

I would store them as follows:

{
  "product" : "product1",
  "city-avail" : [
      {
        "city" : "city1",
        "available" : 0
      },
      {
        "city" : "city2",
        "available" : 1
      }
    ]
}
{
  "product" : "product2",
  "city-avail" : [
      {
        "city" : "city3",
        "available" : 1
      },
      {
        "city" : "city2",
        "available" : 0
      }
    ]
}

1 Comment

fair enough but why? 🤔
1

For complex data (like key value pairs) I would use a nested field type. For simple data, like an array with numbers or strings I use array field type.

So in your case, if you are going to associate "objects" with city and available items I would use a nested field. Then you can search and aggregate by nested fields.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.