0

I have a column called FirstName but it contains first name and middle initial.

The other table I'm trying to compare against only had first name. How do I either match both tables with the one table having the middle initial included?

For example, table A has David as first name and table B has David L. I'm matching on more than just first name, but I want to compare tablea.firstname = tableb.firstname but not after the firstname (space) middle initial.

3
  • 1
    sqlFiddle please.... Commented Jun 16, 2016 at 0:44
  • 3
    Seems like you really need to normalize your database and just avoid this problem altogether. Commented Jun 16, 2016 at 0:47
  • Edited: if TableA contains all correct first names, then the query should be simple AND able to use Indexes Commented Jun 16, 2016 at 2:47

6 Answers 6

1

As much as I appreciate the community's help in improving posts, do NOT change the point/crux of the paragraphs. Information lost is opportunity missed.

  • You should do a little more discovery on the structure and consistency of your data.

Ask questions such as 'Do you have Surrogate keys that identify the names uniquely?' or 'Does your first table containing ending spaces or Leading spaces?' If there is consistency in the naming standards, your solution can be elegant and simple. Otherwise, you may need to cleanse your data first before comparing tables.

  • Character string comparisons are slow and usually require a perfect match (unless the LIKE operator is used with a wildcard character. Col_1 LIKE 'Name%'). Furthermore, if possible, avoid using functions on both sides of your predicates (ON, WHERE, HAVING) as SQL will likely not be able to properly use Indexes on your columns and perform costly table/index scans.

Solution A using unique IDs on the table:

SELECT CustomerID, [First Name] AS First_Name FROM TableA INNER JOIN TableB ON CustomerID = BuyerID WHERE TableA.First_Name = SUBSTRING(TableB.firstname, 1, LEN(TableA.First_Name) )

Easy, no? This uses a SARG in the WHERE clause and only compares the length of code that matters.

  • Q. But I am not sure my data does not have duplicates or uses Surrogate keys

A. Then run the DISTINCT clause on your tables. DISTINCT is basically a GROUP BY, and eliminates M:M comparisons. Also, your code remains lean and can still utilize indexes properly and clearly.

WITH C AS (SELECT FirstName FROM TableA)
SELECT B.FirstName
FROM   C
RIGHT OUTER JOIN (SELECT DISTINCT First_Name FROM TableB) B ON C.FirstName = SUBSTRING(B.First_Name, 1, LEN(C.FirstName) ) 
WHERE TableA.[Tie_Breaker] = TableB.[Tie_Breaker]

Note the RIGHT OUTER JOIN gives you NULLS for the rows that do not have a match in the right table (TABLEB). I left it here since you might like to compare what matches and mismatches you have between the tables (for quality assurane). Plus, you take advantage of simple functions like ISNULL or COALESCE and take pride that you have the leanest code in your department. :)

Special Thanks to @Matt for pointing out the need for tie breakers via the WHERE clause

Sources: MSDN, (n.d.) Predicates (Transact_SQL). Retrieved from MSDN

Sign up to request clarification or add additional context in comments.

Comments

0
Declare @Name Varchar(20) = 'David L'

SELECT LEFT(@Name , ISNULL(NULLIF(CHARINDEX(' ' , @name), 0), LEN(@Name)))

The above select will only return anything to the left of the white space if there is a white space in the string, you can use the same logic to compare the Names stored in two different column of the different tables

tablea.firstname = LEFT(tableb.firstname
                         , ISNULL(NULLIF(CHARINDEX(' ' , tableb.firstname), 0)
                       , LEN(tableb.firstname)))

11 Comments

what about 2 white spaces? I am not seeing how this is greatly different from option I presented. Except the NULLIF and ISNULL but neither of those are needed because CHARINDEX returns 0 when no match is found so the LEN - 0 will still be the entire length of the string
@Matt Even with the two white spaces it will only return the far left string before the first white space, Well looking at your answer now well it is somewhat similar but a slight variation of the same approach.
True very slight variation but I am willing to be Betty Ann is an entire first name in the second database and to match it correctly you won't want to remove it so if Betty Ann K is there you will want to retain Betty Ann
@Matt Names stored in two different places , in a different format, You will never be able to get is 100% working, I know people with two middle names , what do you do then ??
@clifton_h you brought up some very valid points, I think we can all agree that the task is never as simple as it looks when it comes to name and perhaps the most important part is to match on additional criteria if present (e.g. birthdate, phone, email, etc.). The one caution I have with your substring technique is you will get some less desirable results e.g. PAUL will match PAULETTEA, PAULENE | ROSS ROSSANA, ROSSILYN, ROSSY, | SHAUN SHAUNA etc. If matching additional criteria no problem but if matching on name alone could be an issue
|
0

One option would be to join on a substring of the name column which includes only the first name:

SELECT t1.firstname
FROM tableA t1 INNER JOIN tableB t2
    ON t1.firstname = 
       SUBSTRING(t2.firstname,
                 1,
                 CASE WHEN CHARINDEX(' ', t2.firstname) > 0
                      THEN CHARINDEX(' ', t2.firstname) - 1
                      ELSE LEN(t2.firstname)
                 END)

The join condition compares the firstname from tableA against the first word from tableB, assuming that this first word is the first name.

5 Comments

This will error out if the person doesnt have a middle name, and I know a lot of people who do not have a middle name :)
@M.Ali I addressed the edge case you mentioned.
@TimBiegeleisen using LEFT rather than SUBSTRING REALLY saves your need from a long case statement as LEN - CHARINDEX works even when no match is found. Also, Trick is the last space hence CHARINDEX should be on REVERSE(firstname) not just firstname
@Matt I don't know a lot of people with multiple middle names. I don't see anything wrong with my solution unless the OP says so.
@TimBiegeleisen it is valid just making suggestions based on lots of names in databases I have come across. That said if we really looking for all of those variations well I can tell you that is why I still have a job :).
0

You can compare parts of the fields before the space. Easy way is like this:

select t1.firstname,t2.firstname t2_name
from table1 t1
inner join table2 t2 on 
--add space after the value to ensure the space exists
left(t1.firstname+' ',charindex(' ',t1.firstname,1)-1) 
= left(t2.firstname+' ',charindex(' ',t2.firstname,1)-1)

3 Comments

I try to not alter the ON clause, as your code may not be optimized by an index. Thoughts?
Generally it is bad design to make firstname a PK. How can you compare Richard and Dick.
Natural keys are terrible designs in database design. However, I want to assumed that there exists primary keys for each of the tables, which would not only make it trivial to find the answer, but maintain the best SARG for indexing.
0

You can use charindex to check whether your TableB.FirstName(John L) Contains TableA.FirstName(John),Hope this helps.

Where CHARINDEX('tablea.firstname','tableb.firstname')=1

Comments

0

Okay this topic seems to started to go way down the rabbit hole because there is never an easy answer to matching based upon first names in any larger sized datasets. That's why there is an entire industry built on it. There are a few data cleansing products out there to assist with a task like that, which I have used but don't represent. The Microsoft tool available in developer edition and some others editions is Data Quality Services https://msdn.microsoft.com/en-us/library/ff877925.aspx This will utilize several string matching algorithms and multiple fields to determines duplicates within a dataset(s) it is somewhat cumbersome. SSIS offers a fuzzy matching task which also utilizes matching algorithms... You can build your own solutions etc. The general consensus of all of these techniques is to match on data in addition to name. Such as email, address, birth date.

My recommendation in your particular case is to determine the cleanliness of your dataset and try a few of the string manipulation techniques in these answers and see what gets you closest to your goal. It is very likely that a combination or several passes of more than 1 of the techniques will be the answer, if other information is also able to be related to people in both databases I would recommend including additional fields for your matching purposes.

My suggestion(s) as far as fast string manipulations to try.

As long as it is always First name space initial you can strip out the initial and join the value in the on statement. Also reversing the string will let you find the position of the last space in the case the person has 2 first names and an initial.

DECLARE @FirstNameWithInitial VARCHAR(100) = 'bobby lee w'
DECLARE @FirstName VARCHAR(100) = 'bobby lee'

SELECT 
    --unknown number of characters after last space
    LEFT(@FirstNameWithInitial,LEN(@FirstNameWithInitial) - CHARINDEX(' ',REVERSE(@FirstNameWithInitial)))

    --always 2 characters
    ,LEFT(@FirstNameWithInitial, LEN(@FirstNameWithInitial) - 2)

    ,IIF(LEFT(@FirstNameWithInitial,LEN(@FirstNameWithInitial) - CHARINDEX(' ',REVERSE(@FirstNameWithInitial))) = @FirstName,'Join','No Match')

SELECT *
FROM
    TableWithouInitial t
    INNER JOIN TableWithInitial ti
    ON t.Firstname = LEFT(ti.FirstNameWithInitial,LEN(ti.FirstNameWithInitial) - CHARINDEX(' ',REVERSE(ti.FirstNameWithInitial)))

Adding the way to get everything up to the first space so that I am not over engineering to get to the last if more than 1 exists.

SELECT
    LEFT(@FirstNameWithInitial,LEN(@FirstNameWithInitial) - CHARINDEX(' ',@FirstNameWithInitial))

Yep just remove the REVERSE() function from the solution above. It is a personal preference not to use the SUBSTRING or NULLIF or ISNULL. note CHARINDEX() will return 0 if no match is found so the LEN - 0 will be the entire length of the string and will not create any errors if you do not include the NULLIF or ISNULL.

2 Comments

The problem with this solution is that it assumes that if the name column has three strings in it, then two of them are first names and one middle (or vice-versa). We may not be able to make this distinction in practice.
@TimBiegeleisen And the reverse is true that will be depended upon the dataset and more than likely an additional complication that gets set in a CASE statement to determine if more than 1 space exists and then what to do.....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.