1

I'm working on a project that requires documents to be ocr-ed and the text that's returned stored and be searchable. The biggest obstacle is the performance of full-text searching of the scraped text.

My idea is to use a combination of SQL Server for data persistence and Elasticsearch for performant searching. When a document has been scraped it would be inserted into the database and then if that was successful it would be indexed by Elasticsearch.

Can anyone see any caveats with this setup or offer any insight as to how it could be done better?

2
  • 1
    Your approach seems valid. Once you determine the code has passed OCR, you can either save it to RDBMS and then have ES index it. Or you can directly index it in ES. IMO I would skip the RDBMS, since you will not make searches on it, but on ES. Commented Jun 1, 2017 at 13:10
  • I should also store the document in the rdbms or other persistent storage so you can easily recreate the es index with different mapping settings. Commented Jun 1, 2017 at 13:26

1 Answer 1

1

I developed a pretty same project using sql server for the complete erp storage. Written a windows service, which permanently syncs the data i want to search into an elasticsearch cluster. It runs perfectly, on the one side the database with all data and on the other side the es cluster for fast searching.

Sign up to request clarification or add additional context in comments.

2 Comments

Sounds very similar! How did you achieve 'permanent synchronization' between the rdbs and es?
I wrote a windows service using NEST api to sync my data to es. The service runs in defined cycles and queries the database for changes (with LINQ). Then it indexes the data in es cluster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.