PySpark + ElasticSearch: Reading multiple index/type

Question

I'm running PySpark with Elasticsearch back using the Elasticsearch-hadoop connector. I can read from a desired index using:

    es_read_conf = {
        "es.nodes": "127.0.0.1",
        "es.port": "9200",
        "es.resource": "myIndex_*/myType"
    }
    conf = SparkConf().setAppName("devproj")
    sc = SparkContext(conf=conf)

    es_rdd = sc.newAPIHadoopRDD(
        inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
        keyClass="org.apache.hadoop.io.NullWritable",
        valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
        conf=es_read_conf
    )

Works fine. I can wildcard the index.

How do I wildcard the document "type"? Or, how could I get more than one type, or even _all?

Andrei Stefan · Accepted Answer · 2016-04-15 20:54:59Z

2

For all types you can use "es.resource": "myIndex_*".

For the wildcard part you would need a query:

     "prefix": {
        "_type": {
          "value": "test"
        }
      }

answered Apr 15, 2016 at 20:54

Andrei Stefan

52.5k6 gold badges102 silver badges92 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

cybergoof Over a year ago

okay, this worked. If I leave out the "type", it will select "all types"

Collectives™ on Stack Overflow

PySpark + ElasticSearch: Reading multiple index/type

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related