data structure choice for a specific file processing need - java

Question

I looked up the questions similar to mine but I am looking for an optimal solution within the constraints of java in-built data structures.

I have two plain text files. While file1 has a list of usernames, file2 has twitter posts from those users and others. The twitter posts are simply shoved as plain text in the file.

For each user, if there exists a post, I have to pull all the distinct hashtags used in the post(s) (assume hashtags are integers and each post is confined to one line).

Here is my choice of data structure

Map<String, LinkedHashSet<Integer>> usernames = new HashMap<>();

My approach to the problem

Read file1 to populate the usernames keys, put default value as null.
Read file2 sequentially, something like post = file2.readLine()
if a username in the post is found in the hashMap keys, add all discovered hashtags in the post to the value Set.

Does this approach and the data structures picked sound like a good approach for a million users (file1) and say 10 million posts (file2)?

If you are not constrained to the inbuilt data-structures you could investigate a guava multimap plus.google.com/118010414872916542489/posts/djYfMk5daXz — anthonyms
– anthonyms, Commented Jun 4, 2013 at 14:06

scottb · Accepted Answer · 2013-06-04 14:25:25Z

5

I'd say that you're reinventing the wheel. Why worry about making an in-memory relational data model of your own, when there are excellent, fast, capable, mature, robust, and free Java relational databases available.

If I were to do this, I'd simply write a program to read in the data from the text files, and then insert the data into my database. I recommend HSQLDB. Apache Derby is also available as is SQLite if used with a separately available JDBC driver.

The RDBMs takes care of the searching, storing, and data-mapping for you. It would likely be far more robust and more performant than any solution you tried to roll on your own.

If I were to use HSQLDB for this project, then DDL that I would write would look something like this:

CREATE CACHED TABLE Users (
    user_id       INTEGER       GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    :
    :

};

CREATE CACHED TABLE Tweets (
    tweet_id      INTEGER       GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    user_id       INTEGER       NULL,
    :
    :

    CONSTRAINT    twe_fk_user   FOREIGN KEY ( user_id ) REFERENCES Users ( user_id )
);

CREATE CACHED TABLE Tags ( 
    tag_id      INTEGER         GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
    :
    :

);

CREATE CACHED TABLE Tweet_Tag_Bridge (
    tweet_id     INTEGER       NULL,
    tag_id       INTEGER       NULL,

    CONSTRAINT   bridge_pk     PRIMARY KEY ( tweet_id, tag_id ),
    CONSTRAINT   brid_fk_twe   FOREIGN KEY ( tweet_id ) REFERENCES Tweets ( tweet_id ),
    CONSTRAINT   brid_fk_tag   FOREIGN KEY ( tag_id )  REFERENCES Tags ( tag_id )
);

Table tweets is mapped to have a many-to-one relationship with users (a user may have many tweets); and tweets have a many-to-many relationship with tags via the bridge table, tweet_tag_bridge. The use of the primary key in the bridge table assures that tags are unique for any individual tweet (i.e. no tweet should have more than one of any tag).

edited Jun 4, 2013 at 14:25

answered Jun 4, 2013 at 14:08

scottb

10.1k3 gold badges44 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pranay Over a year ago

I understand and appreciate your effort which went to the extent of designing the db for me :). But the requirement is more of a one time processing and prohibits additional resources in terms of using a db. Thanks, nonetheless.

scottb Over a year ago

No worries ... it took only a few moments, which is part of my point. No matter how you approach your problem, you must parse your input and store it in a data structure. Especially if this is a one-time thing, you should use Derby or HSQLDB ... because the time you'd spend coding your data structure would almost certainly be much greater than if you just used a database engine.

Zim-Zam O'Pootertoot · Accepted Answer · 2013-06-04 14:07:04Z

0

You may want to use a TreeSet<Integer> instead of a LinkedHashSet<Integer> - it will use less memory (since it doesn't have a load factor).

answered Jun 4, 2013 at 14:07

Zim-Zam O'Pootertoot

18.2k4 gold badges45 silver badges71 bronze badges

Collectives™ on Stack Overflow

data structure choice for a specific file processing need - java

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related