Efficient Approach to Decompressed String Substring Problem

Question

The problem involves decompressing a compressed string and outputting the substring in the range of L and R. For example:

A(BB(CC)) -> ABBCCCCBBCCCC

L = 52, R = 60, ((((((((IMTOOSTRONG)))))))) -> IMTOOSTRONGIMTOOSTRONGIMTOOSTRONG...

MTOOSTRON

Here, the characters inside the parentheses are repeated, and the goal is to decompress the string and then output the substring in the given range [L, R].

The challenge is that the range [L, R] can go up to 1 billion, so decompressing the entire string would result in TLE (Time Limit Exceeded). I can’t figure out how to approach this problem efficiently.

How can I efficiently extract a substring from a nested compressed string without fully decompressing it, given that the range [L, R] can be as large as 1 billion?

I attempted to think about strategies like: Using recursion or a stack-based approach to simulate decompression. Counting the length of the decompressed string without fully expanding it, to decide which parts of the string are relevant for the [L, R] range.

Put efficient to the back of your mind. What is the simplest thing that could possibly work acceptably? How can you check it works as specified? Check automatically, at that? What are its resource requirements? R being large is no problem when L is small: the output would be large. What about large L? — greybeard
– greybeard, Commented Nov 20, 2024 at 15:05
Please provide enough code so others can better understand or reproduce the problem. — Community
– Community Bot, Commented Nov 20, 2024 at 18:26
I don't get "the strong example" - 1) does the string start at 1 or at 0? 2) R seems to be interpreted to be inclusive - are the numbers represented in regular decimal? — greybeard
– greybeard, Commented Nov 23, 2024 at 4:30
For the "strong example", I would expect the result to be "ONGIMTOOS", not "MTOOSTRON". The repeated string "IMTOOSTRONG" has 11 characters, so it repeats at indices 11, 22, 33, 44, 55, ... At index 52, we are three characters before the next repetition, so the resulting substring should start with "ONG". I don't understand why you say it should be "MTOOSTRON". Please clarify. — trincot
– trincot, Commented Nov 23, 2024 at 8:07

trincot · Accepted Answer · 2024-11-23 09:31:50Z

I would suggest building a tree where the root represents the whole string, and every child represents either a substring represented by parentheses, or a literal string (without parentheses). The leaves of this tree are the literal strings.

Each node would also store the size of the (sub)string it represents (when decompressed).

To represent the repetition factor, you could add another node property that is either 1 or 2. Or, alternatively, you could define the same child node (reference) as a duplicated child of its parent: this creates two edges between the same node pair, so giving that relation a cardinality of 2. Note that this latter approach would turn the tree into a polytree.

Finally define a substring method on nodes which can then use recursion to retrieve the string representations that are relevant for the given range.

Here is an implementation in a JavaScript snippet:

class Node {
    constructor() {
        this.length = 0; // Length of represented string (in uncompressed form)
        this.children = []; // List of strings and child nodes
    }
    addChild(child) { // child could be a Node or a string
        this.length += child.length;
        this.children.push(child);
    }
    substring(start, stop) { // `stop` is excluded from the range
        if (start >= this.length || stop <= 0 || start >= stop) return "";
        let size = stop - start;
        return this.children.map(child => {
            // If child is a string, the native substring method is called here,
            //    otherwise it is a recursive call:
            const s = child.substring(start, start + size);
            start -= child.length;
            return s;
        }).join("");
    }
}

function getTokenIterator(compressed) {
    // Return an iterator over the substrings that either 
    //    are "(", ")" or a series of other characters
    return compressed.match(/[^()]+|./gs).values();
}

function createGraph(tokens) {
    const node = new Node;
    while (true) {
        const token = tokens.next().value;
        if (!token || token === ")") return node;
        if (token === "(") {
            const child = createGraph(tokens);
            // Add child twice to reflect repetition
            node.addChild(child);
            node.addChild(child);
        } else {
            node.addChild(token);
        }
    }
}

function getSlice(compressed, left, right) {
    const tokens = getTokenIterator(compressed);
    const root = createGraph(tokens);
    // As `right` is inclusive in the range, add 1:
    return root.substring(left, right+1);
}

// Example runs
console.log(getSlice("AA(BB(CC))", 3, 8)); // BCCCCB
console.log(getSlice("((((((((IMTOOSTRONG))))))))", 52, 60));

Complexity analysis

Let 𝑛 be the length of the input string.

The algorithm has these parts:

Tokenizing: the input string is scanned from left to right, which is O(𝑛), creating the token strings, which also totals to O(𝑛).
Tree building: For each token the processing has O(1) time complexity. This consist of either making a recursive call and attaching a node twice to the graph, or making a leaf node. So this totals to O(𝑛).
Getting the substring: this involves traversing the polytree. Here the worst case complexity has two components:
- It is determined by the size of the output string (i.e. 𝑅−𝐿+1 assuming these are valid indices in the uncompressed string). In the worst case an input string consists of a deeply nested string (like your "strong example"), and if 𝐿=0 and 𝑅 is maximised, the whole compressed string might need to be expanded. Take for example 𝑛 as odd, and the first 𝑛/2 characters are opening parentheses, the middle character is "A", and the remaining 𝑛/2 characters are closing parentheses. Then the uncompressed string has 2^𝑛/2 characters. As producing a string needs O(1) per character, the time complexity is O(2^𝑛) for this scenario. If we include 𝐿 and 𝑅 in this expression, it becomes O(min(2^𝑛, 𝑅−𝐿)).
- When 𝑅−𝐿+1 is small, like 1, the complexity is not just O(1). There is the overhead of the polytree traversal. The traversal needs to find a leaf (for 𝐿), which in the worst case may involve visiting all internal nodes twice (because they are repeated), where each time a node's children are visited (only) upon the node's second visit. At some point we reach the leaf that is determined by 𝑅 and unwind out of recursion again. The work to go from 𝐿 and 𝑅 (while producing output) was discussed in the previous paragraph, so I omit it here.
This means that the number of distinct nodes is a determining factor also, which is O(𝑛).

Taking all this together the overall complexity can be expressed as:

O(min(2^𝑛, 𝑅−𝐿) + 𝑛)

Thank you for your opionion! What do you expect the time complexity to be? Then, the tree would be binary tree?
I added a section to my answer with complexity analysis. No, the graph is not a tree (like I explained in my answer), but a polytree. It is not binary: a node can have more than two children. For instance, for input "A(B)C(D)E(F)", the root node has 9 edges to leaf nodes (some of which are duplicate node references, as there are only 6 distinct leaf nodes), each representing one letter.

Collectives™ on Stack Overflow

Efficient Approach to Decompressed String Substring Problem

1 Answer 1

Complexity analysis

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Complexity analysis

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related