Introduction
LeetCode 187 asks you to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. The problem asks you to return a list of all such sequences, and you need to solve it efficiently using the given constraints.
Problem Statement
Given a string s
that represents a DNA sequence, you need to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule and return them.
Function Signature:
go
func findRepeatedDnaSequences(s string) []string
Example 1:
go
s := "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
result := findRepeatedDnaSequences(s)
fmt.Println(result) // Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Example 2:
go
s := "AAAAAAAAAAAAA"
result := findRepeatedDnaSequences(s)
fmt.Println(result) // Output: ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 10^5
s[i]
is one of 'A'
, 'C'
, 'G'
, or 'T'
.
Approach
The problem involves detecting repeated 10-character-long substrings from the input string s
. Here’s how we can efficiently solve this problem:
Steps:
- Sliding Window Approach:
- We need to examine every 10-character-long substring in the given string.
- To do this, we can slide a window of size 10 across the string.
- Use a HashMap:
- A HashMap (or map in Go) can be used to track the frequency of each 10-character-long sequence.
- If a sequence appears more than once, it is added to the result list.
- Edge Case Considerations:
- If the string length is less than 10, return an empty list as there are no 10-character sequences.
- Optimization:
- Since we need to check each substring in a string of length up to 10510^5105, the time complexity must be optimal. Using a sliding window combined with a HashMap ensures that we only make a constant-time operation for each sliding step, giving us a time complexity of O(n), where
n
is the length of the string.
Time Complexity:
- O(n), where
n
is the length of the string s
. We iterate through the string and perform constant-time operations (checking substrings, inserting in map) for each sliding window.
Space Complexity:
- O(n) due to the space used by the HashMap to store substrings.
Go Implementation
go
package main
import "fmt"
func findRepeatedDnaSequences(s string) []string {
// Map to store the frequency of each 10-letter substring
seen := make(map[string]int)
result := []string{}
// Slide through the string, considering every substring of length 10
for i := 0; i <= len(s)-10; i++ {
substr := s[i:i+10]
// If the substring has been seen before, add it to the result
if seen[substr] == 1 {
result = append(result, substr)
}
// Increment the count of the substring in the map
seen[substr]++
}
return result
}
func main() {
// Test case 1
s1 := "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
result1 := findRepeatedDnaSequences(s1)
fmt.Println(result1) // Output: ["AAAAACCCCC", "CCCCCAAAAA"]
// Test case 2
s2 := "AAAAAAAAAAAAA"
result2 := findRepeatedDnaSequences(s2)
fmt.Println(result2) // Output: ["AAAAAAAAAA"]
}
Explanation of the Code:
1. findRepeatedDnaSequences function:
- Step 1: Create a
map
called seen
to store the frequency of each 10-character substring.
- Step 2: Iterate through the string
s
from index 0
to len(s)-10
. For each position, extract a 10-character substring and check if it has already appeared. If it has appeared exactly once before, we add it to the result list.
- Step 3: If the substring is encountered for the first time, we store it in the
seen
map and continue.
The map[string]int
stores substrings as keys and their frequencies as values. When we encounter a substring for the second time, it is added to the result list.
2. Main Function:
- In the
main
function, we test the findRepeatedDnaSequences
function with two examples to check the correctness of the implementation.
Example Walkthrough
Example 1:
Input: "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
- We iterate over the string and extract the following substrings:
- "AAAAACCCCC" appears first time.
- "CCCCCAAAAA" appears first time.
- Repeating substrings "AAAAACCCCC" and "CCCCCAAAAA" are added to the result when encountered again.
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Example 2:
Input: "AAAAAAAAAAAAA"
- We iterate and find the substring "AAAAAAAAAA" multiple times.
Output: ["AAAAAAAAAA"]
Conclusion
LeetCode 187: Repeated DNA Sequences is an efficient problem that requires detecting 10-letter-long substrings that appear more than once in a string. Using a sliding window approach combined with a map
data structure, we can solve this problem in O(n) time and O(n) space.
This approach is both time and space efficient and works well for large input sizes, as required by the problem constraints.