Repeated DNA Sequences

描述

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:

["AAAAACCCCC", "CCCCCAAAAA"].

分析

首先能想到一个简单直接的方法,用一个长度为10的窗口,从左到右扫描,放入 HashMap,并把计数器增一。最后,把 HashMap 中所有计数器大于1的字符串输出来。时间复杂度 O(n), 由于HashMap中存储了所有长度为10的子串,所以空间复杂度O(10n)

由于字符串中只存在 A, C, G, T 四种字符,我们可以把每个字符映射为2个bit:

  1. A -> 00
  2. C -> 01
  3. G -> 10
  4. T -> 11

每个长度为10的字符串,可以映射为 20 bits, 小于32位,因此可以把这个字符串映射到一个整数。这个方法时间复杂度依旧是O(n),但空间复杂度下降到了O(n)

解法1 简单粗暴

解法2 完美哈希

原文: https://soulmachine.gitbooks.io/algorithm-essentials/content/cpp/bitwise-operations/repeated-dna-sequences.html