I looked up the questions similar to mine but I am looking for an optimal solution within the constraints of java in-built data structures.
I have two plain text files. While file1 has a list of usernames, file2 has twitter posts from those users and others. The twitter posts are simply shoved as plain text in the file.
For each user, if there exists a post, I have to pull all the distinct hashtags used in the post(s) (assume hashtags are integers and each post is confined to one line).
Here is my choice of data structure
Map<String, LinkedHashSet<Integer>> usernames = new HashMap<>();
My approach to the problem
- Read file1 to populate the usernames keys, put default value as null.
- Read file2 sequentially, something like post = file2.readLine()
- if a username in the post is found in the hashMap keys, add all discovered hashtags in the post to the value Set.
Does this approach and the data structures picked sound like a good approach for a million users (file1) and say 10 million posts (file2)?