1

I have a table people with (among others) fields givenName and gender. I want to update all those rows with gender=NULL according to best guesses based on other rows. That is, if there are teh following rows

"John", NULL
"Jane", NULL
"Sam", NULL
"Alex", NULL
"Jack", NULL
"John", "male"
"John", "male"
"Jane", "female"
"Sam", "female"
"Sam", "male"
"Alex", "female"

I want to produce the following changes:

"John", "male"
"Jane", "female"
"Sam", NULL
"Alex", "female"
"Jack", NULL
...

So John is correctly identified as male, Jane as female, whereas it is left unclear whether Sam is a Samantha or a Samuel. I am aware of the shortcomings of my approach (namely, Alex might in reality be male, and the well-known male name Jack is not recognized as such), but still I wonder if my goal can be achieved with a single SQL query?

If it weren't for the mixed cases (such as "Sam"), I suppose that UPDATE people A, people B SET A.gender = B.gender WHERE A.givenName=B.givenName AND A.gender IS NULL and B.gender IS NOT NULL should do it ...

3
  • I am not sure about single query for that. First you need group by "givenname" and "gender (not null)"... after that second level of group by "givenname" only with COUNT(*)=1 (which means that it is not both male/female. And after that you will have only a "map of names on gender without ambiguity" in you table. Commented Feb 11, 2017 at 11:48
  • @laser In other words, it is probably easier (at least for the human reader and maintainer) to CREATE TEMPORARY TABLE with the non-ambiguous names? Commented Feb 11, 2017 at 12:11
  • Yes, I think so. Otherwise it will be rocket science for any new reader =) Commented Feb 11, 2017 at 12:53

2 Answers 2

1

You could use a dinamically gerated table by select with value for not null having count = 1

  UPDATE  people A
  INNER JOIN  (select name, max(gender) gender
               from people 
               where gender is not null
               group by name
               having count(gender)=1 ) t   on t.name = a.name
  set a.gender = t.gender 
Sign up to request clarification or add additional context in comments.

3 Comments

Your query won't modify rows with gender already set (i.e., "set" them to the already existing value), but would it still be better to add the WHERE-condition AND a.gender IS NULL or does that not matter, performance-wise? (Also I'm confused: Isn't on t.name = a.name and where a.name = t.name redundant?)
should modify all the rows whete the name match .. .. (not regarding the gender ) should select only the name that have one gender different form null .. (reoved redundant where condition) .. where there is one only gender not null .. don't need the update .. and for performance is irrilevant
Remark: In my final application, I used a computed name substring_index(substring_index(name,' ',1),'-',1) both in tand in A, so that "John-Boy" and "John Ross" are treated as if they were "John". In order to work properly, this required me to replace the count(gender)=1 condition with max(gender)=min(gender)
1

In a slight twist by Scais offer, I would apply based on higher probability of your entire table. Obviously you are only showing a small sample. I would try be getting a every name that is on file with corresponding count as male AND female. The result of that should be applied to those missing. Example, if you did have "Jack" in your table 85 times for male and 2 as female (I actually knew a female who went by Jack -- short for Jackie), the "Jack" as male would be applied.

select
      p.name, 
      sum( case when p2.gender = 'male' then 1 else 0 end ) as maleCount,
      sum( case when p2.gender = 'female' then 1 else 0 end ) as femaleCount
   from 
      people p
         join people p2
            on p.name = p2.name
           AND p2.gender IS NOT NULL
   where 
      p.gender is null
   group by 
      p.name

Now, use THAT as the basis with the correlated update in similar fashion to Scais.. Also, we only want to update where the existing gender IS NULL, otherwise we would be updating EVERYONE.

UPDATE  people A
   INNER JOIN  (above query) t
      on t.name = a.name
   set a.gender = case when t.maleCount > t.femaleCount 
                       then 'male' else 'female' end
   where a.gender IS NULL

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.