2

I have two columns which contain email info. Column A, Column B. Now, in these fields, emails SHOULD be as such:

 Column A                 Column B
 [email protected]; [email protected]         [email protected]; [email protected]
 [email protected];[email protected]          [email protected];[email protected]

However, in an effort to do some data quality checks and such, it turns out that MANY entries are instead not following this format. I am trying to find all the outliers, and I have identified the outliers to take the form as such:

[email protected] and [email protected]
[email protected], [email protected] (uses comma so it is incorrect)
[email protected] or [email protected]
[email protected] / [email protected]

There could be other wrong characters or words that make the format incorrect. But I hope these examples pinpoint the issue.

What I am trying to do: Create a query that will pinpoint all instances that are NOT in the correct format, so that the problem points can be found and edited later, but that's a different topic :)

Here is a Query I have so far:

SELECT     A_EMAIL, B_EMAIL, NAME, ID
FROM         NAMES

WHERE A_EMAIL LIKE ('and %') OR A_EMAIL LIKE ('or %') 
OR B_EMAIL LIKE ('and %') OR B LIKE ('or %')

This is using LIKE and the % is with a space in between. However, this returns no results, and I know such results definitely do exist. But I would like to build a logic that would bring me back everything that isn't in the proper format instead of trying to use LIKE 'XYZ' because even though I know most of the problem issues, I could still miss some.

However, if such a thing isn't possible via SQL. Then I would still like to get my current logic of using LIKE ('XYZ %') to work instead which even though not an optimal route, should still be able to help me in my goal someway.

3
  • If at all possible you should normalize this. It will save your hours of grief trying to figure out kludges like this one. Commented Dec 16, 2014 at 21:27
  • Meaning, continue to to try and find the solution that will give me all the problem issues instead of going down a LIKE ('XYZ %') type route. Understood and agree. However, if not possible, I would have to resort to the latter route. Commented Dec 16, 2014 at 21:32
  • Not sure what your last comment meant but...what I meant was to split email into a child table so you don't have multiple values in a single intersection in your table. Then you can add some validation to the application that capture this to prevent this kind of thing in the future. Commented Dec 16, 2014 at 22:18

4 Answers 4

3

I would suggest that you remove spaces and then look for non-email like characters:

where replace(cola, '; ', ';') like '%[^a-zA-Z0-9@ ;.]%'

That is, the column contains an invalid character.

Then, I would encourage you to create a junction table so you are not storing lists of emails in semi-colon delimited columns.

Sign up to request clarification or add additional context in comments.

Comments

1

Your query is fine, you just missed one % in it. Instead of this

WHERE A_EMAIL LIKE ('and %') OR A_EMAIL LIKE ('or %') 
OR B_EMAIL LIKE ('and %') OR B LIKE ('or %')

you should use this

WHERE A_EMAIL LIKE ('%and %') OR A_EMAIL LIKE ('%or %') 
OR B_EMAIL LIKE ('%and %') OR B LIKE ('%or %')

Your original query looks for values that start with 'and ', while you are interested in cases where 'and ' appears anywhere inside the column value.

Of course, this is a one-off solution to your immediate problem. The permanent solution is not to store several e-mails in the same column in the first place.

2 Comments

I'm surprised that you accepted this answer because it does not fully answer the question. For instance, it does not correctly identify '[email protected] / [email protected]' or '[email protected], [email protected]' as being bad.
I just fixed the syntax of the original query. Of course, there can be plenty of other search criteria, depending on the real rubbish in the data. Now it is clear what the problem was with the original query, so it is easy to add more clauses OR LIKE ('%/ %') OR LIKE ('%, %') and so on...
0

Try something like this

Create Table #Emails (Email varchar(128))
Go

Insert into #Emails
Values 
  ('[email protected]')
, ('[email protected]')
, (Null)
, ('[email protected]')
, ('[email protected] and [email protected]')
, ('[email protected] and [email protected]')
, ('[email protected], [email protected]')
, ('[email protected] or [email protected]')
, ('[email protected] / [email protected]')
Go

Select 
  Email
, Case 
    When Email Is Not Null 
     And LTrim(RTrim(Email)) Like '%_@__%.__%' 
     And LTrim(RTrim(Email)) Not Like '% %' 
        Then 'Good' 
        Else 'Bad' 
  End Quality
From #Emails

Original [BAD]

Not perfect, but select with a like statement as below. Good chance will get you majority

NOT LIKE '%_@__%.__%

1 Comment

This brings back entries that are either NULL, or 1 letter @ 1 letter.com (strangely, there were a few occurrences like that). Which still helps. But it doesn't bring back the major problems of ones that contain the word 'AND' between emails or something.
0

I believe a simple Not like query will suffice your requirement here as shown below

Select * From EmailTable Where Email NOT LIKE '%;%'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.