3

I have a 60 MB text file through which my program searches for a specific ID and extract some related text. And I have to repeat the process for 200+ IDs. Initially, I used a loop to cycle through the lines of the file and look for the ID and then extract the related text but it takes way too long(~2 min). So instead, now I am looking at way to load the entire file into memory, then search for my IDs and associated text from there; I imagine that should be faster than accessing the hard drive 200+ times. So I wrote the following code to load the file into memory:

public String createLocalFile(String path)
{   
    String text = "";
    try
    {
        FileReader fileReader = new FileReader( path );
        BufferedReader reader = new BufferedReader( fileReader );
        String currentLine = "";
        while( (currentLine = reader.readLine() ) != null )
        {
            text += currentLine;
            System.out.println( currentLine );
        }

    }
    catch(IOException ex)
    {
        System.out.println(ex.getMessage());
    }
    return text;
}

Unfortunately, saving the file's text into a String variable takes an extremely long time. How can I load the file faster ? Or is there a better way to accomplish the same task ? Thanks for any help.

Edit: Here is the link to the file https://github.com/MVZSEQ/denovoTranscriptomeMarkerDevelopment/blob/master/Homo_sapiens.GRCh38.pep.all.fa

Typical line looks like:

>ENSP00000471873 pep:putative chromosome:GRCh38:19:49496434:49499689:1 gene:ENSG00000142534 transcript:ENST00000594493 gene_biotype:protein_coding transcript_biotype:protein_coding\
MKMQRTIVIRRDYLHYIRKYNRFEKRHKNMSVHLSPCFRDVQIGDIVTVGECRPLSKTVR\
FNVLKVTKAAGTKKQFQKF\

Where ENSP00000471873 is the ID and the text I would be extracting is

MKMQRTIVIRRDYLHYIRKYNRFEKRHKNMSVHLSPCFRDVQIGDIVTVGECRPLSKTVR\
    FNVLKVTKAAGTKKQFQKF\
9
  • 1
    You aren't accessing the hard drive 200 times. No sane operating system works that way. Put the file into some kind of sane structure, like perhaps an array of strings. Commented Sep 28, 2015 at 19:04
  • If you are trying to maintain some sort of "database" in a textfile maybe you should use a DATABASE Commented Sep 28, 2015 at 19:07
  • 4
    You could use a StringBuilder instead of string concatenation (may be that the compiler is already converting your code to use it). Commented Sep 28, 2015 at 19:08
  • I think perhaps you should include your old program. Loading into memory is probably not going to be a good idea at that size. Commented Sep 28, 2015 at 19:08
  • Use StringBuilder for string concatenation to be much faster. To speed up try to search the pattern with parallel threads, see en.wikipedia.org/wiki/Data_parallelism Commented Sep 28, 2015 at 19:08

6 Answers 6

2

You are certainly on the right track thinking you should read this into memory and access it via some sort of mapping. This will remove a lot of the bottleneck, namely being disk I/O and access time (memory is much faster).

I would recommend reading the data into a HashMap with the ID being the key and the Text being the value.

Try something like:

public Map<Integer, String> getIdMap(final String pathToFile) throws IOException {
    // we'll use this later to store our mappings
    final Map<Integer, String> map = new HashMap<Integer, String>();
    // read the file into a String
    final String rawFileContents = new String(Files.readAllBytes(Paths.get(pathToFile)));
    // assumes each line is an ID + value
    final String[] fileLines = rawFileContents.split(System.getProperty("line.separator"));
    // iterate over every line, and create a mapping for the ID to Value
    for (final String line : fileLines) {
        Integer id = null;
        try {
            // assumes the id is part 1 of a 2 part line in CSV "," format
            id = Integer.parseInt(line.split(",")[0]);
        } catch (NumberFormatException e) {
            e.printStackTrace();
        }
        // assumes the value is part 2 of a 2 part line in CSV "," format
        final String value = line.split(",")[1];
        // put the pair into our map
        map.put(id, value);
    }
    return map;
}

This will read the file into memory (in a String), then cut it up into a Map so that it's easy to retrieve the values, ex:

Map<Integer, String> map = getIdMap("/path/to/file");
final String theText = map.get(theId);
System.out.println(theText);

This sample code is untested, and makes some assumptions about your file format, namely that it's one ID and value per line, and that they ID's and Values are comma separated (CSV). Of course, if your data is structured a little differently, just tweak to taste.

UPDATED to match your file description:

public Map<String, String> getIdMap(final String pathToFile) throws IOException {
    // we'll use this later to store our mappings
    final Map<String, String> map = new HashMap<String, String>();
    // read the file into a String
    final String rawFileContents = new String(Files.readAllBytes(Paths.get(pathToFile)));
    // assumes each line is an ID + value
    final String[] fileLines = rawFileContents.split(System.getProperty("line.separator"));
    // iterate over every line, and create a mapping for the ID to Value
    for (final String line : fileLines) {
        // get the id and remove the leading '>' symbol
        final String id = line.split(" ")[0].replace(">", "").trim();
        // use the key 'transcript_biotype:' to get the 'IG_D_gene' value
        final String value = line.split("transcript_biotype:")[1].trim();
        // put the pair into our map
        map.put(id, value);
    }
    return map;
}
Sign up to request clarification or add additional context in comments.

Comments

1

Agreeing with most other comments. 60 MB is not too large for today's memories. But where the time is being sucked is almost certainly in that "+=" appending each line to an increasingly monstrous single string. Make an array of lines.

Better yet, separate out the ID text and the "related text" while reading, to make the later ID searching faster. A hash table would be ideal.

1 Comment

This is accurate, the += is a bad idea. That said, the approach should be changed more so that the data gets some structure rather than just holding the whole file in memory as raw bytes. So I think this answer isn't really helping in the best way.
1

If the file contains a collection of records , then you can
1.Create a class that has id and text content attributes.
2.Read each record from the file and create an object from it and add it to a HashMap.
3. Use the HashMap to retrieve objects by ID

1 Comment

Unfortunately, it isn't organized like that.
0

Supposing that your VM has enough heap assigned to it, you can load the raw file into memory like so:

public byte[] loadFile(File f) throws IOException {
    long size = f.length();
    InputStream source;
    byte[] bytes;
    int nread;
    int next;

    if (size > Integer.MAX_VALUE) {
        throw new IllegalArgumentException("file to long");
    }
    bytes = new byte[(int)size];

    source = new FileInputStream(f);

    for (next = 0; next < bytes.length; next += nread) {
        nread = source.read(bytes, next, bytes.length - next);
        if (nread < 0) {
            throw new FileTruncatedWhileReadingItException();
            // or whatever ...
        }
    }
    if (source.read() != -1) {
        throw new FileExtendedWhileReadingItException(); 
        // or whatever ...
    }

    return bytes;
}

You can then process that in-memory copy instead of reading from disk by creating a ByteArrayInputStream around it -- you should be able to plug that in to your existing code with relative ease.

There may be other ways to optimize still more. For example, if processing the data necessarily involves decoding them into characters, then you could cache the results of the decoding by using a Reader to read into a char[] instead of an InputStream to read into a byte[], and then by proceeding similarly. Do note, however, that storing ASCII data in char form takes twice as much space as storing it in byte form.

If the data are suitable, then it would probably be useful to perform a full parse into some more sophisticated data structure, such as a Map, which could make the subsequent lookups extremely fast. The price, of course, is even more memory usage.

1 Comment

@bayou.io, MappedByteBuffer is certainly an alternative. It has different advantages and disadvantages. It is likely to be much faster to establish, but that's in part because loading data from file into memory can be amortized over subsequent accesses. It's unclear whether to expect overall data access time to be improved. Also, memory-mapping the file leaves the result sensitive to modifications to the underlying file, which might or might not be desired. If you don't need to be able to modify the data, then I'm inclined to prefer straight-up loading it.
0

I think your problem come from the addition of string on text. You should use a StringBuffer instead. I also advice you to use a Scanner class instead of FileReader :

public String createLocalFile(String path)
{   
    StringBuffer text = new StringBuffer();
    try
    {
        Scanner sc = new Scanner( new File(path) );
        while( sc.hasNext() )
        {
            String currentLine = sc.nextLine();
            text.append(currentLine);
            System.out.println( currentLine );
        }

    }
    catch(IOException ex)
    {
        System.out.println(ex.getMessage());
    }
    return text.toString();
}

That should be much faster.

1 Comment

There's no need to use StringBuffer here unless the OP needs thread safety (and the overhead associated with it and StringBuffer). Instead, StringBuilder will likely do just fine here.
0

What you are working with is a FASTA file. Give BioPerl a try...there are tons of libraries to parse and work with these kinds of files. Whatever you are doing, it is most likely done already....

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.