**Disclaimer:**It is a made up problem. Not to be attempted by light hearted.

**Problem:**

I have a 300 word text. I have a large list of indexed strings (Length of string ~ 20, Number of strings ~ 1M). I need to figure out phrases in the 300 word text that match exactly to one of the strings in the large list of strings I have.

A naive approach:

Taking all 45000 (300 C 2) phrases, search in the large list of strings. Can we do better than this? We need to minimize calls to list of indexed strings!

approach 1: Compute Hash values of all 300 texts, run any string matching algo with that and the list of strings you have such that only if the hash values match, comparison is made.

ReplyDeleteapproach 2: If no memory constraints, then run counting sort on the 300 words, and use binary search. Or create 300X300 matrix and run binary search on the 2D matrix. That will will complexity o(log(300square))

We can make trie using 20 length word window size (20 length window will traverse over the entire 300 word text). Then we can traverse the list and search for strings in the trie.

ReplyDelete