I've got a table in SQL Server with a full-text index on an NVARCHAR column, and I want my website's users to be able to search through the table for data that matches their search string. I want to use the CONTAINS predicate to improve performance.
I'm aware that I could use SQL Server's LIKE operator to achieve the same thing, but as detailed here on StackOverflow, LIKE can't use full-text indexes in the same way, impacting performance.
Note that Microsoft's Query With Full-Text Search page details asterisks, double quotes, FORMSOF, AND, OR, NOT, and a whole bunch of other control characters and keywords for CONTAINS. I can write a parser to validate/sanitise the input myself, but not only is that difficult and time-consuming, I run the risk that future versions introduce new keywords that my validation misses. Anyway, this really feels like the kind of validation that Microsoft should've written themselves, so that I can reuse it easily.
How do I properly sanitise user input to CONTAINS to avoid all possible special input? Alternatively, how can I validate that the input doesn't contain special input, so that I can return a validation message to the user if it does?
Let me be clear: I don't need to give users fancy functionality, such as the ability to search for rows that contain either 'nymph' or 'jocks'. (Even if I did, I'd want the ability to build that statement manually, and I'd either sanitise both of their inputs to avoid malformed queries, or I'd validate them so that I can reject inputs with special input.) I just want them to be able to type a word and click Search without the risk that a typo sends them to the 500 error page... nor the risk that a script kiddie crashes my website with a carefully crafted string that performs a Denial-of-Service attack against CONTAINS.
(Bonus question: is there a simpler way to build a performant and secure text search?)
Run this in SQL Server 2022 Express:
CREATE TABLE TestData
(
Id BIGINT NOT NULL IDENTITY (1, 1) CONSTRAINT PK_TestData_Id PRIMARY KEY CLUSTERED,
DataCol NVARCHAR(200) NOT NULL
)
GO
INSERT INTO TestData (DataCol) VALUES ('Waltz, bad nymph, for quick jigs vex.')
INSERT INTO TestData (DataCol) VALUES ('Sphinx of black quartz, judge my vow.')
INSERT INTO TestData (DataCol) VALUES ('Glib jocks, quiz nymph to vex dwarf.')
INSERT INTO TestData (DataCol) VALUES ('Cwm fjord glyphs vext bank quiz.')
--plus millions of others
GO
--Not necessary if your database already has a default full-text catalog
CREATE FULLTEXT CATALOG TestDataCatalog AS DEFAULT;
GO
CREATE FULLTEXT INDEX ON TestData (DataCol LANGUAGE 1033)
KEY INDEX PK_TestData_Id
WITH
(CHANGE_TRACKING = OFF, STOPLIST = SYSTEM)
;
GO
CREATE PROCEDURE SearchTestData
@Name NVARCHAR(200)
AS
BEGIN
SELECT DataCol
FROM TestData
WHERE CONTAINS(DataCol, @Name)
END
Now let's test the above:
EXEC SearchTestData 'nymph'
We get back both rows that contain the word 'nymph' and no other rows. We can confirm with SSMS's Display Estimated Execution Plan button that the stored procedure is using the full-text index.
So far so good, right? Wrong. The user can put CONTAINS-specific control characters and keywords into their input, accidentally or deliberately. Try this line:
EXEC SearchTestData 'nymph,'
Did you think that this would return the single entry that contains 'nymph,', or perhaps both entries with the word 'nymph' and a comma in them? Nope, it crashes:
Syntax error near ',' in the full-text search condition 'nymph,'.
Let's try SQL injection:
EXEC SearchTestData 'dsg; DROP TABLE SearchTestData; --'
At least SQL Server's query parameterisation prevents SQL injection attacks from the website input getting into the SearchTestData stored procedure, but we do get the same user input problem as before:
Syntax error near 'DROP' in the full-text search condition 'dsg; DROP TABLE SearchTestData; --'.
CONTAINS, have them provide values that you will then build to make an appropriate expression forCONTAINS.