Indexing html with Sphinx without complex scripts

Question

As far as I know the sphinx search engine can index html, but it doesn't have any in-built drivers like it does for sql-data. That means we have to parse and prepare html content ourselves.

Does anyone know of any drivers or third party add-ons make sphinx index html automatically?

Can anyone help? Thanks in advance.

barryhunter · Accepted Answer · 2014-05-27 21:03:46Z

2

Well if you have a database of the .html filenames, can use

http://sphinxsearch.com/docs/current.html#conf-sql-file-field

to index them, sphinx will load each individiaul file in turn and index the contents.

Combine with http://sphinxsearch.com/docs/current.html#conf-html-strip

answered May 27, 2014 at 21:03

barryhunter

21.1k3 gold badges32 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

see613 Over a year ago

Sorry. Is there any way to get the content of a certain tag (i.e. "title", "h1") and to put this in a separate index field using an sql_file_field or another Sphinx functionality?

barryhunter Over a year ago

It has zones feature to deal with that sphinxsearch.com/docs/current.html#conf-index-zones which enables the special ZONE limit operator

see613 Over a year ago

One more question. In some cases I have matches with keywords joxi.ru/joGIU_3JTJA0VFTn9Hk and everything works great, but in other cases there are no matches. I do have hits joxi.ru/I4KIU_3JTJADVOh9kwo and I'm absolutely sure these words contain keywords.

barryhunter Over a year ago

The 'words' data, is raw, the number of matches in the whole index, regardless of ANY other filters. Whereas the actual matches (and total_found) honour all the particular filters and the whole text query. ... so suspect even though one of the words match, the whole query doesnt.

Collectives™ on Stack Overflow

Indexing html with Sphinx without complex scripts

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related