2

I am using apache spark 1.5 dataframe with elasticsearch, I am try to filter id from a column that contains a list(array) of ids.

For example the mapping of elasticsearch column is looks like this:

    {
        "people":{
            "properties":{
                "artist":{
                   "properties":{
                      "id":{
                         "index":"not_analyzed",
                         "type":"string"
                       },
                       "name":{
                          "type":"string",
                          "index":"not_analyzed",
                       }
                   }
               }
          }
    }

The example data format will be like following

{
    "people": {
        "artist": {
            [
                  {
                       "id": "153",
                       "name": "Tom"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  }
            ]
        }
    }
},
{
    "people": {
        "artist": {
            [
                  {
                       "id": "369",
                       "name": "Carl"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  },
                 {
                       "id": "698",
                       "name": "Sol"
                  }
            ]
        }
    }
}

In spark I try this:

val peopleId  = 152
val dataFrame = sqlContext.read
     .format("org.elasticsearch.spark.sql")
     .load("index/type")

dataFrame.filter(dataFrame("people.artist.id").contains(peopleId))
    .select("people_sequence.artist.id")

I got all the id that is contains 152, for example 1523 , 152978 but not only id == 152

Then I tried

dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId))
    .select("people.artist.id")

I get empty, I understand why, it's because I have array of people.artist.id

Can anyone tell me how to filter when I have list of ids ?

0

1 Answer 1

8

In Spark 1.5+ you can use array_contains function:

df.where(array_contains($"people.artist.id", "153"))

If you use an earlier version you can try an UDF like this:

val containsId = udf(
  (rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.