Index CSV to ElasticSearch in Python

Question

Looking to index a CSV file to ElasticSearch, without using Logstash. I am using the elasticsearch-dsl high level library.

Given a CSV with header for example:

name,address,url
adam,hills 32,http://rockit.com
jane,valleys 23,http://popit.com

What will be the best way to index all the data by the fields? Eventually I'm looking to get each row to look like this

{
"name": "adam",
"address": "hills 32",
"url":  "http://rockit.com"
}

It looks like elasticsearch-dsl depends on the elasticsearch-py library. Checkout elasticsearch-py's docs on an example of how to insert a document. — user378704
– user378704, Commented Jan 10, 2017 at 17:14
The question is not about indexing documents, but about a technique how to index entire .csv files into elasticsearch — bluesummers
– bluesummers, Commented Jan 10, 2017 at 19:06

Ashish Gupta · Accepted Answer · 2017-06-15 03:43:05Z

41

This kind of task is easier with the lower-level elasticsearch-py library:

from elasticsearch import helpers, Elasticsearch
import csv

es = Elasticsearch()

with open('/tmp/x.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index='my-index', doc_type='my-type')

edited Jun 15, 2017 at 3:43

Ashish Gupta

15.2k21 gold badges81 silver badges136 bronze badges

answered Jan 11, 2017 at 13:44

Honza Král

3,03216 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

bluesummers Over a year ago

This is the kind of answer I was looking for, I will try it in a few hours when and respond accordingly, thanks!

bluesummers Over a year ago

Exactly the Pythonic and elegant solution I was looking for - Thanks!

Souad Over a year ago

what about the mapping how to make it so that we know the type of each filed?

Honza Král Over a year ago

@shinz4u just wrap the reader in something that will add the desired id as _id key in the dictionary, then it will be taken up by elasticsearch

Honza Král Over a year ago

@seamaner that just means that elasticsearch cannot process the data you are sending fast enough. You can increase the timeout (10s by default) by passing timeout=N to Elasticsearch when instantiating it (where N > 10)

|

Alex · Accepted Answer · 2017-07-11 10:45:10Z

1

If you want to create elasticsearch database from .tsv/.csv with strict types and model for a better filtering u can do something like that :

class ElementIndex(DocType):
    ROWNAME = Text()
    ROWNAME = Text()

    class Meta:
        index = 'index_name'

def indexing(self):
    obj = ElementIndex(
        ROWNAME=str(self['NAME']),
        ROWNAME=str(self['NAME'])
    )
    obj.save(index="index_name")
    return obj.to_dict(include_meta=True)

def bulk_indexing(args):

    # ElementIndex.init(index="index_name")
    ElementIndex.init()
    es = Elasticsearch()

    //here your result dict with data from source

    r = bulk(client=es, actions=(indexing(c) for c in result))
    es.indices.refresh()

answered Jul 11, 2017 at 10:45

Alex

1,3852 gold badges26 silver badges45 bronze badges

Collectives™ on Stack Overflow

Index CSV to ElasticSearch in Python

2 Answers 2

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related