2

I have a movie table with 2 columns. ID(int) and MetaData(XML). The MetaData looks like this:

<movie xmlns="urn:schemas-xxx:yyy:catalog" >
  <credits>
    <credit creditId="15594954" creditType="Actor" >aaa</credit>
    <credit creditId="15573106" creditType="Actor" >bbb</credit>
    <credit creditId="15781056" creditType="Actor" >bbb</credit>
    <credit creditId="15781056" creditType="Actor" >ddd</credit>
    <credit creditId="15606109" creditType="Director" >ddd</credit>
    <credit creditId="16316911" creditType="Art Director" >adadad</credit>
    <credit creditId="18484117" creditType="Choreographer" >ch</credit>
    <credit creditId="15707268" creditType="Cinematographer" >cm</credit>
    <credit creditId="15907445" creditType="Screenwriter">sss</credit>
    <credit creditId="15905546" creditType="Screenwriter" >ggg</credit>
    <credit creditId="16493602" creditType="Editor" >eee</credit>
    <credit creditId="15825749" creditType="Composer" >ccc</credit>
    <credit creditId="18486706" creditType="Composer" >ddd</credit>
  </credits>
</movie>

I want to find records where there are duplicates within a credit type - Here actor "bbb" is a duplicate (but "ddd" is not).

If I have a query like below, it even throws records where the actor is also the director. But I don’t want them to show up.

-- Check for Duplicate Cast and Crew
WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE Metadata.value('count(/movie/credits/credit)', 'int') <> Metadata.value('count(distinct-values(/movie/credits/credit))', 'int')

 If I modify my query like below, it works.

WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE 
 (
    (Metadata.value('count(/movie/credits/credit[@creditType="Actor"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Actor"]))', 'int')
        )

    OR (Metadata.value('count(/movie/credits/credit[@creditType="Director"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Director"]))', 'int')
        )
     OR (Metadata.value('count(/movie/credits/credit[@creditType="Producer"])', 'int') <> 
        Metadata.value('count(distinctvalues(/movie/credits/credit[@creditType="Producer"]))', 'int')
        )
)

But there are a lot of credit types like composer, editor etc and I don’t want to do it this way for every credit type. Is there any efficient way to do this?

Update:

I found that the previous query did case-sensitive search. I needed a case-insensitive one, so changed it like the one below:

WITH XMLNAMESPACES (DEFAULT 'urn:xxx:yyy:catalog')
SELECT Count(*) FROM
(
SELECT  ID
FROM Movie
CROSS APPLY
Movie.Metadata.nodes('/movie/credits/credit[@creditType="Actor"]') x(y)
GROUP BY ID
HAVING 
 COUNT(y.value('.', 'varchar(100)')) <> COUNT(Distinct y.value('.', 'varchar(100)'))
) AS temp;

But My original problem still remains.

1 Answer 1

1

You can use a FLOWR and check the count for each distinct value of @creditType. Return a dummy node and check for existence of nodes using exist().

with xmlnamespaces(default 'urn:schemas-xxx:yyy:catalog')
select count(*)
from Movie as M
where M.Metadata.exist('
  for $creditType in distinct-values(/movie/credits/credit/@creditType)
  where count(distinct-values(/movie/credits/credit[@creditType = $creditType]/text())) != count(/movie/credits/credit[@creditType = $creditType]/text())
  return <X/>') = 1;

SQL-Fiddle

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! This works! Is there a way to get the distinct-values in a xquery in a case insensitive way? now it considers 'bb' and 'BB' as different. For my case, they are duplicates.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.