30. Substring with Concatenation of All Words

Problem Statement

You are given a string s and an array of strings words. All the strings of words are of the same length.
A concatenated substring in s is a substring that contains all the strings of any permutation of words concatenated.
For example, if words = ["ab","cd","ef"], then "abcdef", "abefcd", "cdabef", "cdefab", "efabcd", and "efcdab" are all concatenated strings. "acdbef" is not a concatenated substring because it is not the concatenation of any permutation of words. Return the starting indices of all the concatenated substrings in s. You can return the answer in any order.
Example 1:
Input: s = "barfoothefoobarman", words = ["foo","bar"]
Output: [0,9]
Explanation: Since words.length == 2 and words[i].length == 3, the concatenated substring has to be of length 6.
The substring starting at 0 is "barfoo". It is the concatenation of ["bar","foo"] which is a permutation of words.
The substring starting at 9 is "foobar". It is the concatenation of ["foo","bar"] which is a permutation of words.
The output order does not matter. Returning [9,0] is fine too.
Example 2:

Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"]
Output: []
Explanation: Since words.length == 4 and words[i].length == 4, the concatenated substring has to be of length 16.
There is no substring of length 16 is s that is equal to the concatenation of any permutation of words.
We return an empty array.
Example 3:

Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"]
Output: [6,9,12]
Explanation: Since words.length == 3 and words[i].length == 3, the concatenated substring has to be of length 9.
The substring starting at 6 is "foobarthe". It is the concatenation of ["foo","bar","the"] which is a permutation of words.
The substring starting at 9 is "barthefoo". It is the concatenation of ["bar","the","foo"] which is a permutation of words.
The substring starting at 12 is "thefoobar". It is the concatenation of ["the","foo","bar"] which is a permutation of words.

Constraints:

1 <= s.length <= 10000
1 <= words.length <= 5000
1 <= words[i].length <= 30
s and words[i] consist of lowercase English letters.

My solution

Single iteration sliding window, window of size word length and check if word is in the list, tests like [aa,aa], aaaaaaaaaa failed.

Solution

An important detail in the problem description to notice is that all elements in words have the same length. This gives us valuable information about all valid substrings - we know what length they will be. Each valid substring is the concatenation of words.length words which all have the same length, so each valid substring has a length of words.length * words[0].length.
This makes it easy for us to take a given index and check if a valid substring starting at this index exists. Let's say that the elements of words have a length of 3. Then, for a given starting index, we can just look at the string in groups of 3 characters and check if those characters form a word in words. Because words can have duplicate words, we should use a hash table to maintain a count for each word. As a bonus, a hash table also lets us search for word matches very quickly. We can write a helper function that takes an index and returns if a valid substring starting at this index exists. Then, we can build our answer by running this function for all candidate indices. The logic for this function can be something along the lines of: Iterate from the starting index to the starting index plus the size of a valid substring. Iterate words[0].length characters at a time. At each iteration, we will look at a substring with the same length as the elements in words. If the substring doesn't exist in words, or it does exist but we already found the necessary amount of it, then return false. We can use a hash table to keep an updated count of the words between the starting index and the current index.
Algorithm Initialize some variables: n as the length of s. k as the length of words wordLength as the length of each word in words. substringSize as wordLength * k, which represents the size of each valid substring. wordCount as a hash table that tracks how many times a word occurs in words. Create a function check that takes a starting index i and returns if a valid substring starts at index i: Create a copy of wordCount to make use of for this particular index. Let's call it remaining. Also, initialize an integer wordsUsed which tracks how many matches we have found so far. Iterate starting from i. Iterate until i + substringSize - we know that each valid substring will have this size, so we don't need to go further. At each iteration, we will be checking for a word - and we know each word has a length of wordLength, so increment by wordLength each time. If the variable we are iterating with is j, then at each iteration, check for a word sub = s.substring(j, j + wordLength). If sub is in remaining and has a value greater than 0, then decrease its count by 1 and increase wordsUsed by 1. Otherwise, break out of the loop. At the end of it all, if wordsUsed == k, that means we used up all the words in words and have found a valid substring. Return true if so, false otherwise. Now that we have this function check, we can just check all possible starting indices. Because a valid substring has a length of substringSize, we only need to iterate up to n - substringSize. Build an array with all indices that pass check and return it.

Approach 2: Sliding Window

Intuition
In the previous approach, we made use of the fact that all elements of words share the same length, which allows us to efficiently check for valid substrings. Unfortunately, we repeated a lot of computation - each character of s is iterated over many times. Imagine if we had an input like this:
s = "barfoobarfoo" and words = ["bar", "foo"]
Valid substrings start at index 0, 3, and 6. Notice that the substrings starting at indices 0 and 3 share the same "foo". That means we are iterating over and handling this "foo" twice, which shouldn't be necessary. We do it again with the substrings starting at indices 3 and 6 - they use the same "bar". In this specific example it may not seem too bad, but imagine if we had an input like:
s = "aaaa...aaa", s.length = 10,000 and words = ["a", "a", ..., "a", "a"], words.length = 5000
We would be iterating over the same characters millions of times. How can we avoid repeated computation? Let's make use of a sliding window. We can re-use most of the logic from the previous approach, but this time instead of only checking for one valid substring at a time with each call to check, we will try to find all valid substrings in one pass by sliding our window across s.
So how will the left and right bounds of the window move, and how can we tell if we our window is a valid substring? Let's say we start at index 0 and do the same process as the previous approach - iterate wordLength at a time, so that at each iteration we are focusing on one potential word. Our iteration variable, say right, can be our right bound. We can initialize our left bound at 0, say left = 0.
Now, right will move at each iteration, by wordLength each time. At each iteration, we have a word sub = s.substring(right, right + wordLength). If sub is not in words, we know that we cannot possibly form a valid substring, so we should reset the entire window and try again, starting with the next iteration. If sub is in words, then we need to keep track of it. Like in the previous approach, we can use a hash table to keep count of all the words in our current window.
When our window has reached the maximum size (substringSize), we can check if it is a valid substring. Like in the previous approach, we can use an integer wordsUsed to check if wordsUsed == words.length to see if we made use of all the elements in words, and thus have a valid substring. If we do, then we can add left to our answer.
Whether we have a valid substring or not, if our window has reached maximum size, we need to move the left bound. This means we need to find the word we are removing from the window, and perform the necessary logic to keep our hash table up to date.
Another thing to note: we may encounter excess words. For example, with s = "foofoobar", and words = ["foo", "bar"], the two "foo" should not be matched together to have wordsUsed = 2. Whenever we find that sub is in words, we should check how many times we have seen sub so far in the current window (using our hash table), and if it is greater than the number of times it appears in words (which we can find with a second hash table, wordCount in the first approach), then we know we have an excess word and should not increment wordsUsed.
In fact, so long as we have an excess word, we can never have a valid substring. Therefore, another criterion for moving our left bound should be to remove words from the left until we find the excess word and remove it (which we can accomplish by comparing the hash table values).
Now that we've described the logic needed for the sliding window, how will we apply the window? In the first approach, we tried every candidate index (all indices up until n - substringSize). In this problem, you may notice that starting the process from two indices that are wordLength apart is pointless. For example, if we have words = ["foo", "bar"], then starting from index 3 is pointless since by starting at index 0, we will move over index 3. However, we will still need to try starting from indices 1 and 2, in case the input looks something like s = "xfoobar" or s = "xyfoobar". As such, we will only need to perform the sliding window wordLength amount of times.

Search This Blog

Code and learn