0

Using the function replace

Replace(FieldX,'FindString','ReplaceString') where FieldX = 'ABC'

Works fine until there is an additional match inside the string that I don't want to replace.

In my case I have an address field that I got as ALL CAPS. However I want to change 'PR' to 'Prairie' when it occurs as:

  • 'PR %'
  • '% PR'
  • % PR %'

Yet if I do:

Update TableA 
Set Address=Replace(Address,'PR','PRAIRIE')
where Address like '%PR ' or Address like 'PR %' or Address like '% PR '

Then 'PR PRIMO' becomes 'PRAIRIE PRAIRIEIMP'

I thought, even though it gets cumbersome given the extent of my changes I could solve this in three queries

 Update TableA 
 Set Address=Replace(Address,'PR ','PRAIRIE ')
 where Address like  like 'PR %'

 Update TableA 
 Set Address=Replace(Address,' PR',' PRAIRIE')
 where Address like  like '% PR'

 Update TableA 
 Set Address=Replace(Address,' PR ',' PRAIRIE ')
 where Address like  like '% PR %'

But this will be cumbersome (again I have far more replacements to do and other issues) and seems like it could still generate errors I haven't anticipated. The replace tables are also very large and this triples the processing time.

Has anyone run into a way to solve this is a less heavy-handed approach? If this were regex I could get away with it I think but I've found regex adds a huge overhead to this type of replacement and as I said the tables are large.

4
  • Given the lack of actual regular expressions, have you considered a script that could go over the several combinations you have in mind in a loop? I've used this approach before and it has worked very well when single-query updates are not practical/possible. Is a longer sub-string possible? Commented May 5, 2016 at 23:41
  • MySQL doesn't have a built-in regexp replace function, but you can find some UDFs by googling for it. Commented May 6, 2016 at 0:23
  • @ray when I get to this point I generally use Excel believe it or not and set up tables, fields and term/replace term in cells and then a string function ending in ; in the rightmost cell. Then I simply copy/paste down and voila. I am pretty much resigned to doing this here. I ended up with around 300 shorter queries which beats beating my head figuring out a workaround Commented May 6, 2016 at 3:29
  • @Barmar What I am actually looking into now is using Sphinx which I use for other purposes. Essentially I am trying to have a constant ID between tables with that same customer address and finding standardizing a HUGE pain. If I could somehow have the abbreviations indexed than instead of crazy standardization scripts I could match internally via the Index whether 'Pr Prairie' or 'PR Prairie'. Commented May 6, 2016 at 3:34

2 Answers 2

1

You can do this (I think) by wrapping everything with two spaces and then replacing that. (This takes care of the ^PR and PR$ cases if using a regex without affecting pr within words as this would never have a space before and afterwards. Use trim as a final step to remove the spaces:

mysql> SELECT TRIM(REPLACE(' PR PRIMO ', ' PR ', ' PRAIRIE '));
+--------------------------------------------------+
| TRIM(REPLACE(' PR PRIMO ', ' PR ', ' PRAIRIE ')) |
+--------------------------------------------------+
| PRAIRIE PRIMO                                    |
+--------------------------------------------------+
1 row in set (0.00 sec)

Note that if using lots of replaces on huge tables, using a table to coordinate the update should save you significant time. Below is an example where the spaces are added and removed via concat in the update allowing you to just add normal values to the replacement table.

Code:

DROP TABLE IF EXISTS hugeTable;
CREATE TABLE hugeTable(address CHAR(32));

DROP TABLE IF EXISTS replacements;
CREATE TABLE replacements(find CHAR(8), `replace` CHAR(8));

INSERT INTO hugeTable VALUES ('PR PRIMO');

INSERT INTO replacements VALUES ('PR', 'PRAIRIE');

SELECT * FROM hugeTable;

UPDATE hugeTable A, replacements B
SET A.address = TRIM(REPLACE(CONCAT(' ', A.address, ' '), CONCAT(' ', B.find, ' '), CONCAT(' ', B.`replace`, ' ')));

SELECT * FROM hugeTable;

Query:

mysql> CREATE TABLE hugeTable(address CHAR(32));
Query OK, 0 rows affected (0.10 sec)

mysql>
mysql> DROP TABLE IF EXISTS replacements;
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE replacements(find CHAR(8), `replace` CHAR(8));
Query OK, 0 rows affected (0.02 sec)

mysql>
mysql> INSERT INTO hugeTable VALUES ('PR PRIMO');
Query OK, 1 row affected (0.04 sec)

mysql>
mysql> INSERT INTO replacements VALUES ('PR', 'PRAIRIE');
Query OK, 1 row affected (0.01 sec)

mysql>
mysql> SELECT * FROM hugeTable;
+----------+
| address  |
+----------+
| PR PRIMO |
+----------+
1 row in set (0.00 sec)

mysql>
mysql> UPDATE hugeTable A, replacements B
    -> SET A.address = TRIM(REPLACE(CONCAT(' ', A.address, ' '), CONCAT(' ', B.find, ' '), CONCAT(' ', B.`replace`, ' ')));
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql>
mysql> SELECT * FROM hugeTable;
+---------------+
| address       |
+---------------+
| PRAIRIE PRIMO |
+---------------+
1 row in set (0.00 sec)

Regards,

James

Sign up to request clarification or add additional context in comments.

9 Comments

That is actually pretty astonishing. I tested it using your Mysqls adding 'Primo PR' and 'W Primo PR' and 'W PR Primo' i.e. will it find/replace beginning, middle, end. It does so swimmingly and fast. My 500 query script from last night is still running as each time it gets to 'where '% PR' or 'where '% PR %' type queries it takes forever on these million record tables. About to stop this (which will take a good full day of processing and testing using your query and will report back/update
Ok I am simultaneously testing the brute-force mysql on a 26mil record table and yours on a 2.6 million record table (since I don't want to lose anytime). I'll report back. My initial test I ran on the first 1000 records, with 19 update, in about 5 minutes. Doesn't bode well for 2600x that but hoping it is not linear :)
Are you filling the replacements table with all the replacements? Or running the INSERT and UPDATE for every replacement separately? It should be one statement. Also try putting a WHERE on the update (same INSTR as the SET as this may speed things up) will try later, am away from my computer.
@JamesJones Do you think a combination of a "script" and the core query would be preferable to doing a join? The join is between 250 records and 25 million records. I wonder if combining brute force, that is one query for each find/replace but using the concat would not be more efficient? I will of course test it but currently in the middle of testing the first so thought perhaps you might have feedback on that. In the first case we have a huge join for overhead to process ALL find/replace, in the second case no join but need to run 250 separate update queries.
How Iong is one replacement no join update taking on 10x6 records?
|
1

Though I accepted James Scott's answer which was a great solution, I did make some mods and thought I'd include here since his solution is an elegant one and with a few tweaks made this update in fact possible.

Recalling his core set was:

SET A.address = 
TRIM(REPLACE(CONCAT(' ', A.address, ' '), 
CONCAT(' ', B.find, ' '), 
CONCAT(' ', B.`replace`, ' ')));
  1. I used the core concept of padding the term, and find/replace in the `Set' function he suggested.
  2. Rather than have the find/replace be a second table requiring a join (which meant joining 300 records to 26 million records) I made a script (using Excel) to make a single query per find/replace.

  3. I added a where clause to reduce the set of records to be examined which is critical with 26 million records (yes I tested on the query w and w/o Where). This was only possible because of the addition of the padding (concat) since I could now do one single where pass with %findterm% vs the additional two passes of % findterm and findterm % while the padding ensures that findterm is a discrete word.

  4. Lastly, because the findterms can be stored as either Uppercase (PL) or Proper (Pl) I installed a function I found here for case insensitive replace (Case Insensitive REPLACE for MySQL) so that I didn't have to run each query twice to accommodate each case.

A sample query looked like this

SET address = 
TRIM(REPLACE_ci(CONCAT(' ',address, ' '), 
CONCAT(' ', 'PL', ' '), 
CONCAT(' ', 'Place', ' '))) where address like '%PL%';

The stats on the update run made successful:

  • 300 find/replace terms/queries
  • 5 Tables
  • Total 42 million records
  • Largest table 26 million records
  • Smallest table 1/2 million records
  • Updated 3.5 million records
  • Ten hours

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.