0

I have a list of JSON twitter files that I read into a list in Python like so:

data5=[]
with codecs.open('twitFile_5.txt','rU') as file5:
    for line in file5:
       data5.append(json.loads(line))

I can select "text" for example to give me a selected tweet

data5[1]["text"]

However I don't know how to

1) just make a list of all the "text" items

2) search that "text" list and count the number of times a list of phrases is mentioned in the text e.g. ['apple', 'orange fruit', 'bunch of bananas'].

Thanks.

2 Answers 2

1

It sounds like map and reduce could solve these:

For example:

texts = map(lambda x: x['text'], data5)

and:

texts = ['apple test', 'test orange fruit']

init = { 'apple': 0, 'orange fruit': 0, 'bunch of bananas': 0 }

def aggregate(agg,x):
  for k in agg:
    if k in x:
      agg[k] += 1
  return agg

counts = reduce(aggregate, texts, init)

Edit

Per comment:

values = [
    {'text': 'apple test', 'user': 'A'},
    {'text': 'test orange fruit', 'user': 'B'}
  ]

init = { 'apple': [], 'orange fruit': [], 'bunch of bananas': [] }

def aggregate(agg,x):
  for k in agg:
    if k in x['text']:
      agg[k].append(x)
  return agg

counts = reduce(aggregate, values, init)
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks! I should probably just ask a new question - but how about if I want to also return other list details for each of the searched for tweets, such as "user" for example?
@Betty Do you mean like instead of x['text']? You could return any arbitrary object. The simplest would be a tuple so something like (x['text'],x['user']) and that might work well for a handful of fields. A tuple is basically just a fixed size list so you access things by index. For something more robust, you would probably want to define a class and construct and return an instance.
What's there is perfect for now. What I was thinking was for every time "apple" is found returning more details such as the user, but perhaps it would make more sense to use a SQL database for that kind of querying.
@Betty I updated the answer. Rather than an integer counter, you can use a list. The length of the list then would be the "count", but this way you can get the matched values as well.
Excellent, thanks! One more thing as this is great stuff to learn...This returns everything in my list relating to the search term. Is there any way to just return certain fields like "user" or "date_created"?
|
1

1) Use a list comprehension

texts = [d["text"] for d in data5] 

2) List comprehension again

count = len([t for t in texts if 'apple' in t])

I'm interpreting your post to mean you want to count the number of texts that mention "apple." If you want to count the number of times "apple" occurs you can use

count = sum([t.count('apple') for t in texts])

1 Comment

Thank you so much. How about if I want to also return other list details for each of the searched for tweets, such as "user" for example.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.