Trie structure using Hash in Ruby
up vote
0
down vote
favorite
This is a follow-up to Boggle board solver in Ruby
class Trie < Hash
def build(string)
string << '.' # Terminator indicating end of a complete word
string.chars.inject(self) do |h, char|
h[char] ||= {}
end
end
end
# Build the trie from all the words
@trie = Trie.new
File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }
Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
(It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).
When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.
Is there a way to reduce the memory consumption, perhaps, implementing this differently?
ruby memory-optimization trie
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
up vote
0
down vote
favorite
This is a follow-up to Boggle board solver in Ruby
class Trie < Hash
def build(string)
string << '.' # Terminator indicating end of a complete word
string.chars.inject(self) do |h, char|
h[char] ||= {}
end
end
end
# Build the trie from all the words
@trie = Trie.new
File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }
Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
(It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).
When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.
Is there a way to reduce the memory consumption, perhaps, implementing this differently?
ruby memory-optimization trie
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
This is a follow-up to Boggle board solver in Ruby
class Trie < Hash
def build(string)
string << '.' # Terminator indicating end of a complete word
string.chars.inject(self) do |h, char|
h[char] ||= {}
end
end
end
# Build the trie from all the words
@trie = Trie.new
File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }
Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
(It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).
When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.
Is there a way to reduce the memory consumption, perhaps, implementing this differently?
ruby memory-optimization trie
This is a follow-up to Boggle board solver in Ruby
class Trie < Hash
def build(string)
string << '.' # Terminator indicating end of a complete word
string.chars.inject(self) do |h, char|
h[char] ||= {}
end
end
end
# Build the trie from all the words
@trie = Trie.new
File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }
Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
(It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).
When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.
Is there a way to reduce the memory consumption, perhaps, implementing this differently?
ruby memory-optimization trie
ruby memory-optimization trie
asked Sep 10 at 5:18
mydoghasworms
1214
1214
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Bloom Filters
I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.
A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
1) The item is definitely not in the data structure
2) The item is maybe in the data structure
To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].
Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.
Simple concrete example
Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).
When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.
Some notes
1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.
2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
4) When confirming a word exist in your word list, you should, of course, use a binary search.
Choosing m and k
There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Bloom Filters
I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.
A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
1) The item is definitely not in the data structure
2) The item is maybe in the data structure
To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].
Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.
Simple concrete example
Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).
When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.
Some notes
1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.
2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
4) When confirming a word exist in your word list, you should, of course, use a binary search.
Choosing m and k
There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.
add a comment |
up vote
0
down vote
Bloom Filters
I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.
A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
1) The item is definitely not in the data structure
2) The item is maybe in the data structure
To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].
Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.
Simple concrete example
Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).
When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.
Some notes
1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.
2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
4) When confirming a word exist in your word list, you should, of course, use a binary search.
Choosing m and k
There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.
add a comment |
up vote
0
down vote
up vote
0
down vote
Bloom Filters
I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.
A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
1) The item is definitely not in the data structure
2) The item is maybe in the data structure
To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].
Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.
Simple concrete example
Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).
When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.
Some notes
1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.
2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
4) When confirming a word exist in your word list, you should, of course, use a binary search.
Choosing m and k
There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.
Bloom Filters
I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.
A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
1) The item is definitely not in the data structure
2) The item is maybe in the data structure
To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].
Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.
Simple concrete example
Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).
When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.
Some notes
1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.
2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
4) When confirming a word exist in your word list, you should, of course, use a binary search.
Choosing m and k
There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.
answered Oct 4 at 14:37
Zack
1,736312
1,736312
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f203441%2ftrie-structure-using-hash-in-ruby%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown