1

I'm trying to figure out how to approach a solution where I can query a table that has a field with multiple formats, and my input format may vary as well.

I have a few tables that have the same PIN column (VARCHAR(20)), but in each of the tables, the format can vary like shown below. Typically it's one format per table, but you can see all the different variations I might run into.

PIN               |  ID
---------------------------
01-123.040-111-2  |  5
01-123.04-111     |  6
003.242424242.23  |  7
01.1234.345.22    |  8
1234456789        |  9

I'd like to be able to accept any of the following variations of input below:

> 012304041112
> 01.3456.342.22
> 02-3232323.2331

Maybe some of the input formats will exactly match, some wont. So here's what I'm thinking:

I'm using PHP, so I can strip out the -'s and .'s or any spaces to just get the raw number, but I don't know how to make a comparison to that number that might be in the column in the table. If there is a way of comparing digits to just digits that would most likely be ideal.

For example:

input of 647382627 would match on 64.738.262-7 in the database

Another situation might be where there is input like this:

12-25-9-123

Where it should match:

12-25-009-123

[edit] To Clarify what I mean here- Different counties use different patterns for parcel numbers. A county might use:

XX-XXXX-XXX-XX

for their pattern, but in some documents they might use say:

10-1234-5-2 where it translates to 10-1234-005-02

We'd know what counties this applies to, but the input may be

10123452 or 10-1234-005-02 or 10-1234-5-2

So I don't know how to exactly make that comparison. I guess if you'd strip dashes and zeros from input and the column you could come close, and just return a few matches to pick from if need be.

5
  • Idea for (1) situation: strip dashes and dots from input and compare with also stripped data coming from your database. Is it fast? No, but then again - you either want to strip everything and compare raw numbers or run every possible combination of your input mixed with dashes and dots against the db (probably won't be faster). Commented Jan 18, 2019 at 0:24
  • @KamilGosciminski Would I have to create a column that contains the stripped data (I'm talking about potentially over 60k rows), or can I do that comparison on the fly? Commented Jan 18, 2019 at 0:25
  • You can do it on the fly, but to be honest, since MySQL doesn't support functional indexes, I would really consider a bit of an overhead by storing additional column that will only contain digits and create an index on it. This should work faster. There are computed columns in MySQL, so it could take care of the stripping for you when you insert the data. This wouldn't solve the (2) situation, though, as you need to know positions of dots and dashes to pad them with zeros and check for a match - second situation is much more complicated. Commented Jan 18, 2019 at 0:26
  • I've added my answer using generated columns and unique index for the first case. Commented Jan 18, 2019 at 0:57
  • What version of MySQL? Commented Jan 18, 2019 at 2:52

2 Answers 2

2

Using mysql, you could use a regular expression to strip all non-numeric characters from the fields before comparing them, like :

REGEXP_REPLACE(pin, '[^0-9]', '')
= REGEXP_REPLACE(?, '[^0-9]', '')

Where ? is your input for the search.

Regular expression '[^0-9]' means : any character other than 0, 1, ..., 9.

This should solve your initial description of the problem, however it will not handle the last example that you gave, where '12-25-9-123' should match '12-25-009-123' : for this, we need to modify the regexp. I suggest that the additional rule should be : any 0 that is immediatly preceeded by a - should be suppressed.

Here is a the modified regex :

REGEXP_REPLACE(pin, '(-0+)|([^0-9])', '')

Explanation :

            EITHER
(-0+)         a dash followed by at least one 0
|           OR
([^0-9]+)   any non-numeric character

Here is an example that you can find in this db fiddle :

 WITH mytable AS (
     SELECT '64.738.262-7' pin, '647382627' compare 
     UNION SELECT '12-25-9-123', '12-25-009-123'
     UNION SELECT 'abc', '12-25-009-123'
 )
 SELECT 
     pin,
     compare, 
    CASE 
        WHEN (REGEXP_REPLACE(pin, '(-0+)|([^0-9])', '') 
            = REGEXP_REPLACE(compare, '(-0+)|([^0-9])', ''))
        THEN 'match'
        ELSE 'no match'
    END result
 FROM mytable

 pin          | compare       | result
:----------- | :------------ | :------- 64.738.262-7 | 647382627 | match
12-25-9-123 | 12-25-009-123 | match
abc | 12-25-009-123 | no match

Sign up to request clarification or add additional context in comments.

2 Comments

I made a clarification to the second situation at the bottom in the original question.
@Photovor : thanks for the clarification, I updated my answer.
0

(1) Situation solution idea

Create a generated column in your MySQL table to store only digits from pin column:

ALTER TABLE yourtable 
  ADD COLUMN pin_digits VARCHAR(20) 
  GENERATED ALWAYS AS (REGEXP_REPLACE(pin, '[^0-9]', '')) STORED;

Then create an unique index on it to disallow duplicates:

ALTER TABLE yourtable ADD UNIQUE INDEX uq_idx_pin_digits (pin_digits);

When comparing (by trying to insert) your input to stored data you can now benefit of this index:

INSERT INTO yourtable (pin) VALUES (REGEXP_REPLACE(?, '[^0-9]', '')); 
-- where ? is your input value passed from PHP (without any changes)
-- this will yield an error on unique constraint if the value already exists

Live DEMO

Click here to see how it works.

1 Comment

I like the idea of this, I can probably satisfy issue 2 by stripping out 0s also. I’ve seen some of these tables have duplicate values in the pin column, so I’d need to find a way to return duplicates before I create the index.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.