0

I'm analyzing the GDELT dataset and I want to determine thematic clusters. Simplifying considerably, GDELT parses news articles and extracts events. As part of that, it recognizes, let's say, 250 "themes" and tags each "event" it records in a column a semi-colon separated list of all themes identified in the article.

With that preamble, I've extracted, for 2016, a list of approximately 350,000 semi-colon separated theme lists, such as these two:

  • TAX_FNCACT;TAX_FNCACT_QUEEN;CRISISLEX_T11_UPDATESSYMPATHY;CRISISLEX_CRISISLEXREC;MILITARY;TAX_MILITARY_TITLE;TAX_MILITARY_TITLE_SOLDIER;TAX_FNCACT_SOLDIER;USPEC_POLITICS_GENERAL1;WB_1458_HEALTH_PROMOTION_AND_DISEASE_PREVENTION;WB_1462_WATER_SANITATION_AND_HYGIENE;WB_635_PUBLIC_HEALTH;WB_621_HEALTH_NUTRITION_AND_POPULATION;MARITIME_INCIDENT;MARITIME;MANMADE_DISASTER_IMPLIED;
  • CRISISLEX_CRISISLEXREC;EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTSOFINTEREST_COLLEGE;TAX_FNCACT;TAX_FNCACT_MAN;TAX_ECON_PRICE;SOC_POINTSOFINTEREST_UNIVERSITY;TAX_FNCACT_JUDGES;TAX_FNCACT_CHILD;LEGISLATION;EPU_POLICY;EPU_POLICY_LAW;TAX_FNCACT_CHILDREN;WB_470_EDUCATION;

As you can see, both of these lists both contain "TAX_FNACT" and "CRISISLEX_CRISISLEXREC". Thus, "TAX_FNACT;CRISISLEX_CRISISLEXREC" is a 2-item cluster. A better understanding of GDELT informs us that it isn't a particularly useful cluster, but it is one nevertheless.

What I'd like to do, ideally, is compose a dictionary of lists. The key for the dictionary is the number of items in the cluster and value is a list of tuples of all theme clusters with that "key" number of elements paired with the number of times that cluster appeared. This ideal algorithm would run until it identified the largest cluster.

Does an algorithm already exist that I can use for this purpose and if so, what is it named? If I had to guess, I would imagine we've created something to extract x-item clusters and then I would just loop from 2->? until I don't get any results.

3
  • perhaps data science datascience.stackexchange.com or computer science cs.stackexchange.com is a better place to ask this question. Commented Feb 22, 2017 at 2:51
  • Thanks - I wasn't sure. I'll ask over in datascience. Commented Feb 22, 2017 at 2:54
  • Don't repost questions. Flag for a moderator to migrate the question. The cs recommendation is bad, I'd rather suggest stats. Commented Feb 22, 2017 at 7:52

1 Answer 1

1

Clustering won't work well here.

What you describe looks rather like frequent itemset mining. Where the task is to find frequent combinations of 'items' in lists.

Sign up to request clarification or add additional context in comments.

1 Comment

Cool, exactly what I was looking for. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.