1

I need to build a function to replace multiple substrings for all rows in a table.

Performance is not a big concern, as this is a one-time operation, but there are 48 mappings and roughly 30,000 rows. I know looping over the whole database 48 times is quite stupid, but SQL is not my wheelhouse. If this were Java or C++, it'd be cake.

Basically, I need the SQL analog of the following function. If SQL can't short-circuit loops, that's fine. I've seen the SQL replace function, but encapsulating it properly in a user-defined function is my major stumbling block.

I'm using Microsoft SQL Server if that produces any particular quirks.

mapping[] maps = { {" st ", " Street "}, {" st. ", " Street "}, ...};

for(row r : table) {
    String orig = r.data(colName);
    for(mapping m : maps) {
        r.data(colName).replace(m.first, m.second);
        if(r.data(colName) != orig)
            break;
    }
}
10
  • Are the things you'd have in maps constant (i.e. literal), or something dynamic, like from another table? Commented Jul 19, 2017 at 23:22
  • What schema are you trying to update? Can you include the table definitions and some sample data before/after? There's not enough information in here to really answer your question. Commented Jul 19, 2017 at 23:23
  • why do you need to do this in "better" sql if you don't care about performance -- you have a loop you know does exactly what you want -- just use it. Commented Jul 19, 2017 at 23:25
  • Without knowing more, this can be done by simply chaining calls to REPLACE, ex) REPLACE(REPLACE(colName, ' st. ', ' Street '), ' st ', ' Street ') Commented Jul 19, 2017 at 23:26
  • 1
    @hatchet - I don't think hard coding it with multiple replace is what he was asking for -- an you didn't either -- that is why you put in a comment and not an answer. Commented Jul 19, 2017 at 23:40

2 Answers 2

2

@Hogan has the right idea. This syntax should be closer to working:

WITH map as (
      SELECT v.*
      FROM (VALUES (' st ', ' Street ', 1),
                   (' st. ', ' Street ', 2)
           ) v(str, repstr, n)
     ),
     cte as (
      SELECT replace(t.field, map.str, map.repstr) as field, map.n as n
      FROM t JOIN
           map
           ON map.n = 1
      UNION ALL
      SELECT replace(cte.field, map.str, map.repstr) as field, map.n + 1
      FROM cte JOIN
           map
           ON map.n = cte.n + 1
     )
SELECT field 
FROM (SELECT cte.*, MAX(cte.n) OVER (PARTITION BY cte.field) as maxn
      FROM cte
     ) x
WHERE n = maxn;

You may want to include more fields in the CTE from the original table.

Sign up to request clarification or add additional context in comments.

Comments

1
CREATE FUNCTION [dbo].[StandardizeAddress](@address varchar(123))
RETURNS varchar(250)
WITH SCHEMABINDING
AS
BEGIN
    RETURN
        REPLACE(REPLACE(
                @address + ' '
                , ' st ', ' Street')
                , ' st. ', ' Street ')
END

Creating a scalar function like this is how we did this. Using the code above to compute addresses from a table of 171,000 rows took 240 ms. Using our actual function, which has more than 80 substitutions, and does some other manipulations takes 5 seconds for 171,000 rows. However, we store the standardized version of addresses, because we are doing complex person searches and precompute standardized values for performance sake. So the function is only run once when a row is added or an address modified, so the speed of this function is not an issue.

For comparison, Gordon's solution takes 4.5 seconds against the same dataset (vs. 240 ms for chained REPLACE). With 4 replacements instead of two, the CTE solution takes 7.8 seconds, vs. 275 ms for REPLACE.

I need to add a caveat that there is a limit to how many function calls can be nested. According to another question on stackOverflow, the limit is 244, which is a fair amount larger than the default max recursion limit for recursive CTEs.

Another option that's a bit slower (about 75% more time) than nested REPLACE functions is something like the following:

select c3.address from (select REPLACE(@address, ' st ', ' Street ') address) c1
        cross apply (select REPLACE(c1.address, ' st. ', ' Street ') address) c2
        cross apply (select REPLACE(c2.address, ' dr ', ' Drive ') address) c3

But I don't see much advantage for that. You could also write a c# CLR function, but I suspect that the calling overhead might make it slower than just using the nested REPLACE calls.

Edit- Since posting this I posted an answer to a similar question that seems to be in the speed ballpark of nested REPLACE, but is much cleaner (and more maintainable) code.

6 Comments

Since speed isn't much an issue wouldn't a solution with a map table and a recursive clean up be much easier to maintain?
The version we have is pretty easy to maintain. And we don't have to worry about the recursion limit. Most developers understand REPLACE. Not as many understand recursive Common Table Expressions.
That's right SQL Server has a recursion limit. You don't need people to look at the CTE -- just a mapping table with replace_target and replace_with columns.
Well, the REPLACE version is an order of magnitude faster with a small number of mappings (and possibly orders of magnitude with a large set of mappings), is easy to maintain, easy to understand, and works. That's good enough for me.
and when you change the final select to SELECT TOP 1 field FROM cte ORDER BY n DESC from Gordon's solution?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.