Seeking appropriate clustering algorithm

Question

I'm analyzing the GDELT dataset and I want to determine thematic clusters. Simplifying considerably, GDELT parses news articles and extracts events. As part of that, it recognizes, let's say, 250 "themes" and tags each "event" it records in a column a semi-colon separated list of all themes identified in the article.

With that preamble, I've extracted, for 2016, a list of approximately 350,000 semi-colon separated theme lists, such as these two:

TAX_FNCACT;TAX_FNCACT_QUEEN;CRISISLEX_T11_UPDATESSYMPATHY;CRISISLEX_CRISISLEXREC;MILITARY;TAX_MILITARY_TITLE;TAX_MILITARY_TITLE_SOLDIER;TAX_FNCACT_SOLDIER;USPEC_POLITICS_GENERAL1;WB_1458_HEALTH_PROMOTION_AND_DISEASE_PREVENTION;WB_1462_WATER_SANITATION_AND_HYGIENE;WB_635_PUBLIC_HEALTH;WB_621_HEALTH_NUTRITION_AND_POPULATION;MARITIME_INCIDENT;MARITIME;MANMADE_DISASTER_IMPLIED;
CRISISLEX_CRISISLEXREC;EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTSOFINTEREST_COLLEGE;TAX_FNCACT;TAX_FNCACT_MAN;TAX_ECON_PRICE;SOC_POINTSOFINTEREST_UNIVERSITY;TAX_FNCACT_JUDGES;TAX_FNCACT_CHILD;LEGISLATION;EPU_POLICY;EPU_POLICY_LAW;TAX_FNCACT_CHILDREN;WB_470_EDUCATION;

As you can see, both of these lists both contain "TAX_FNACT" and "CRISISLEX_CRISISLEXREC". Thus, "TAX_FNACT;CRISISLEX_CRISISLEXREC" is a 2-item cluster. A better understanding of GDELT informs us that it isn't a particularly useful cluster, but it is one nevertheless.

What I'd like to do, ideally, is compose a dictionary of lists. The key for the dictionary is the number of items in the cluster and value is a list of tuples of all theme clusters with that "key" number of elements paired with the number of times that cluster appeared. This ideal algorithm would run until it identified the largest cluster.

Does an algorithm already exist that I can use for this purpose and if so, what is it named? If I had to guess, I would imagine we've created something to extract x-item clusters and then I would just loop from 2->? until I don't get any results.

perhaps data science datascience.stackexchange.com or computer science cs.stackexchange.com is a better place to ask this question. — rsm
– rsm, Commented Feb 22, 2017 at 2:51
Don't repost questions. Flag for a moderator to migrate the question. The cs recommendation is bad, I'd rather suggest stats. — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented Feb 22, 2017 at 7:52

Has QUIT--Anony-Mousse · Accepted Answer · 2017-02-22 07:54:24Z

1

Clustering won't work well here.

What you describe looks rather like frequent itemset mining. Where the task is to find frequent combinations of 'items' in lists.

answered Feb 22, 2017 at 7:54

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

argyle Over a year ago

Cool, exactly what I was looking for. Thank you.

Collectives™ on Stack Overflow

Seeking appropriate clustering algorithm

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related