1

As far as I know the sphinx search engine can index html, but it doesn't have any in-built drivers like it does for sql-data. That means we have to parse and prepare html content ourselves.

Does anyone know of any drivers or third party add-ons make sphinx index html automatically?

Can anyone help? Thanks in advance.

1 Answer 1

2

Well if you have a database of the .html filenames, can use

http://sphinxsearch.com/docs/current.html#conf-sql-file-field

to index them, sphinx will load each individiaul file in turn and index the contents.

Combine with http://sphinxsearch.com/docs/current.html#conf-html-strip

Sign up to request clarification or add additional context in comments.

4 Comments

Sorry. Is there any way to get the content of a certain tag (i.e. "title", "h1") and to put this in a separate index field using an sql_file_field or another Sphinx functionality?
It has zones feature to deal with that sphinxsearch.com/docs/current.html#conf-index-zones which enables the special ZONE limit operator
One more question. In some cases I have matches with keywords joxi.ru/joGIU_3JTJA0VFTn9Hk and everything works great, but in other cases there are no matches. I do have hits joxi.ru/I4KIU_3JTJADVOh9kwo and I'm absolutely sure these words contain keywords.
The 'words' data, is raw, the number of matches in the whole index, regardless of ANY other filters. Whereas the actual matches (and total_found) honour all the particular filters and the whole text query. ... so suspect even though one of the words match, the whole query doesnt.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.