MySQL regexp pattern to match repeated words within a single string

Question

Can someone please help me solve a problem that I'm having with MySQL and REGEXP?

I am working on cleaning MySQL table containing vehicle inventory. The table has several million rows. I am trying to come up with a regex pattern that will find repeated words in each cell and replace only one of them with a SPACE character keeping the other. Here is an example of my table. There are many more columns in that table, but I only included a few for demonstration purpose.

If you notice, the 2 columns MAKE and MODEL contain repeated words (i.e. "FORD FORD", "TOYOTA TOYOTA" etc.). This table was loaded from an old Excel file that used to be maintained manually. As you can see, the data is extremely dirty. I'm trying to do as much cleaning as possible to standardize the data. I want to keep only one copy of each repeated word removing the duplicates (i.e. "FORD", "TOYOTA", "NISSAN" etc.).

I was able to solve this problem partially (see code below):

update t_inventory
set make = trim(regexp_replace(make, '(\\([A-Za-z]+\\))', ' '))
where make regexp '^([A-Za-z]+)([^a-zA-Z0-9]+)(\\([A-Za-z]+\\))'
    and mid(make, 1, instr(make, '(') - 2) = 
        mid(make, instr(make, '(') + 1, instr(make, ')') - instr(make, '(') - 1);

The above code solves the problem for the values like "FORD (FORD)" or "TOYOTA (TOYOTA)" where first word is unwrapped, second word is inside parentheses and no other leading or trailing characters. But when I have a string like "MAKE NISSAN (NISSAN)" the above code won't work. It will replace word NISSAN with SPACE leaving only word MAKE.

Is there any way to write a single REGEXP pattern to remove all repeated words only keeping one? I don't even care if parentheses are kept. I can easily clean them later.

You'll probably ask why not find all possible garbage, create a dictionary and then write a procedure to filter it out. Yes, it would be ideal if the table had a few hundred to a few thousand rows. But my table has millions of rows. As I mentioned above, this data was migrated from Excel file that was maintained manually for over 20 years. It's hard to imagine how dirty the data there is. What you see in the diagram above is as simple as it can get. I wouldn't have asked for help if it wasn't as complex.

I really appreciate your help. Thank you so much in advance!

Since this is a one-time process, it might be easier, faster (see below), and more accurately done by a programming language. Retrieve each row. Fix the data. Update the row. Commit the transaction every 1000 rows or so to keep it performant. About it being faster, that is in total time. If you can do it in a SQL query that will probably be faster, but getting that query right may take more time than writing and executing the program. — Schwern
– Schwern, Commented Dec 22, 2022 at 20:53
All that mid/instr stuff can be replaced with REGEXP_SUBSTR(), can't it? — Barmar
– Barmar, Commented Dec 22, 2022 at 20:55
To answer Schwern's question. Yes, absolutely! But I was hoping that it would be simple to find repeated words by using regex. It got me very curious. I struggled for some time and wasn't able to find a solution. — idelin
– idelin, Commented Dec 22, 2022 at 20:59
As for Barmar's question, I'm not sure. I never used REGEXP_SUBSTR function, but if it's there, then you're probably right. But it still won't solve my main problem. — idelin
– idelin, Commented Dec 22, 2022 at 21:00

Bill Karwin · Accepted Answer · 2022-12-22 21:00:21Z

1

Dirty data is often too chaotic to fix in a single UPDATE.

Answer: use more than one UPDATE!

UPDATE t_inventory
SET make = TRIM(LEADING 'MAKE' FROM make);

UPDATE t_inventory
SET make = REPLACE(make, 'FORD (FORD)', 'FORD');

UPDATE t_inventory
SET make = REPLACE(make, 'NISSAN (NISSAN)', 'NISSAN');

UPDATE t_inventory
SET make = REPLACE(make, 'HONDA (HONDA)', 'HONDA');

...and so on...

Every such edit is very simple to write.

You will probably now ask if you can also change NISSAN (NISSAN in the same UPDATE.

You're still thinking about combining the edits! Stop that. Just do multiple edits.

UPDATE t_inventory
SET make = REPLACE(make, 'NISSAN (NISSAN', 'NISSAN');

It does take longer to execute multiple edits. I understand you said your table has millions of rows. But if you compare to the time it takes you to develop a clever way of combining the edits, it's probably a wash. Besides, computers are good at executing the change over the millions of rows. You just have to wait for it to finish.

answered Dec 22, 2022 at 21:00

Bill Karwin

567k87 gold badges709 silver badges869 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

idelin Over a year ago

The query examples that you suggested are actually being taken care of via my single regexp query. That query I gave as an example takes care of such simple text strings. But it won't more complex with leading and trailing characters or whole words. I was just hoping that there would be some kind of a REGEXP pattern condition or operator that will search repeated parts of the text containing only letters ignoring numbers and other non alpha characters.

Rick James · Accepted Answer · 2022-12-23 03:37:56Z

mysql> SELECT REGEXP_REPLACE("FORD (FORD)", '\\b(\\w+)\\b(.*)\\b(\\1)\\b(.*)$', '$1$2$4');
+-----------------------------------------------------------------------------+
| REGEXP_REPLACE("FORD (FORD)", '\\b(\\w+)\\b(.*)\\b(\\1)\\b(.*)$', '$1$2$4') |
+-----------------------------------------------------------------------------+
| FORD ()                                                                     |
+-----------------------------------------------------------------------------+

That used version 8.0.31; another version may have different syntax.

Note that the replacement rebuilt the string without the second (that is $3) copy of 'FORD'.

Explaining some things:

\\b  -- word boundary (start or end)
(...)  -- capture what the insides matched into $1, $2, etc
\\w+  -- any string of letters
.*  -- a string of anything ('greedy' version)
\\1  -- match what the first (1) "(...)" matched
$  -- anchored at the end  (Probably unnecessary)

and on the replacement side:

$n  -- the nth (...) matched

The example with 3 NISSANs gets a lot messier. The .* probably sent sailing through the second copy. .*? is "non-greedy"; that does replace the second NISSAN:

+-----------------------------------------------------------------------------------------------------------------------+
| REGEXP_REPLACE("NISSAN ALTIMA (NISSAN WHITE 2019 ) (NISSAN WHITE 2019",
         '\\b(\\w+)\\b(.*)\\b(\\1)\\b(.*)$', '$1$2$4') |
+-----------------------------------------------------------------------------------------------------------------------+
| NISSAN ALTIMA (NISSAN WHITE 2019 ) ( WHITE 2019                                                                       |
+-----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

+------------------------------------------------------------------------------------------------------------------------+
| REGEXP_REPLACE("NISSAN ALTIMA (NISSAN WHITE 2019 ) (NISSAN WHITE 2019",
         '\\b(\\w+)\\b(.*?)\\b(\\1)\\b(.*)$', '$1$2$4') |
+------------------------------------------------------------------------------------------------------------------------+
| NISSAN ALTIMA ( WHITE 2019 ) (NISSAN WHITE 2019                                                                        |
+------------------------------------------------------------------------------------------------------------------------+

Plan A: One approach is to run two UPDATEs, one to handle 2-copy rows, and another to handle 3-copy rows.

Plan B: Alternatively, some kind of looping mechanism. This could be done with a Stored Procedure and probably some different functions, such as SUBSTRING_INDEX().

Plan C: The third (and arguably best) is to pull the data into your app which probably has better string and regexp handling capabilities.

This seemed to work on most cases. There are still scenarios where this does not work: original: NISSAN ALTIMA (NISSAN WHITE 2019 ) (NISSAN WHITE 2019 replaced: NISSAN ALTIMA (NISSAN WHITE 2019 ) (WHITE 2019 Your logic removes the last "NISSAN" occurrence, but kept the 2nd. I could run your logic multiple times to eventually remove all occurrences, but I'm wondering if your logic can be modified to take care of all. Also, if you don't mind, could you please explain to me your pattern logic? Thank you so much!
Thank you so much for explaining the patter logic. This makes total sense. Thank you for the 3 plans approaches. This also makes sense. I will think about it, but overall, this already helps me a lot. Thank you!

Collectives™ on Stack Overflow

MySQL regexp pattern to match repeated words within a single string

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related