9

Looking to index a CSV file to ElasticSearch, without using Logstash. I am using the elasticsearch-dsl high level library.

Given a CSV with header for example:

name,address,url
adam,hills 32,http://rockit.com
jane,valleys 23,http://popit.com

What will be the best way to index all the data by the fields? Eventually I'm looking to get each row to look like this

{
"name": "adam",
"address": "hills 32",
"url":  "http://rockit.com"
}
2
  • It looks like elasticsearch-dsl depends on the elasticsearch-py library. Checkout elasticsearch-py's docs on an example of how to insert a document. Commented Jan 10, 2017 at 17:14
  • The question is not about indexing documents, but about a technique how to index entire .csv files into elasticsearch Commented Jan 10, 2017 at 19:06

2 Answers 2

41

This kind of task is easier with the lower-level elasticsearch-py library:

from elasticsearch import helpers, Elasticsearch
import csv

es = Elasticsearch()

with open('/tmp/x.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index='my-index', doc_type='my-type')
Sign up to request clarification or add additional context in comments.

9 Comments

This is the kind of answer I was looking for, I will try it in a few hours when and respond accordingly, thanks!
Exactly the Pythonic and elegant solution I was looking for - Thanks!
what about the mapping how to make it so that we know the type of each filed?
@shinz4u just wrap the reader in something that will add the desired id as _id key in the dictionary, then it will be taken up by elasticsearch
@seamaner that just means that elasticsearch cannot process the data you are sending fast enough. You can increase the timeout (10s by default) by passing timeout=N to Elasticsearch when instantiating it (where N > 10)
|
1

If you want to create elasticsearch database from .tsv/.csv with strict types and model for a better filtering u can do something like that :

class ElementIndex(DocType):
    ROWNAME = Text()
    ROWNAME = Text()

    class Meta:
        index = 'index_name'

def indexing(self):
    obj = ElementIndex(
        ROWNAME=str(self['NAME']),
        ROWNAME=str(self['NAME'])
    )
    obj.save(index="index_name")
    return obj.to_dict(include_meta=True)

def bulk_indexing(args):

    # ElementIndex.init(index="index_name")
    ElementIndex.init()
    es = Elasticsearch()

    //here your result dict with data from source

    r = bulk(client=es, actions=(indexing(c) for c in result))
    es.indices.refresh()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.