Trie structure using Hash in Ruby











up vote
0
down vote

favorite












This is a follow-up to Boggle board solver in Ruby



class Trie < Hash
def build(string)
string << '.' # Terminator indicating end of a complete word
string.chars.inject(self) do |h, char|
h[char] ||= {}
end
end
end

# Build the trie from all the words
@trie = Trie.new
File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }


Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
(It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).



When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.



Is there a way to reduce the memory consumption, perhaps, implementing this differently?










share|improve this question














bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.



















    up vote
    0
    down vote

    favorite












    This is a follow-up to Boggle board solver in Ruby



    class Trie < Hash
    def build(string)
    string << '.' # Terminator indicating end of a complete word
    string.chars.inject(self) do |h, char|
    h[char] ||= {}
    end
    end
    end

    # Build the trie from all the words
    @trie = Trie.new
    File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }


    Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
    (It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).



    When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.



    Is there a way to reduce the memory consumption, perhaps, implementing this differently?










    share|improve this question














    bumped to the homepage by Community yesterday


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

















      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      This is a follow-up to Boggle board solver in Ruby



      class Trie < Hash
      def build(string)
      string << '.' # Terminator indicating end of a complete word
      string.chars.inject(self) do |h, char|
      h[char] ||= {}
      end
      end
      end

      # Build the trie from all the words
      @trie = Trie.new
      File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }


      Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
      (It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).



      When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.



      Is there a way to reduce the memory consumption, perhaps, implementing this differently?










      share|improve this question













      This is a follow-up to Boggle board solver in Ruby



      class Trie < Hash
      def build(string)
      string << '.' # Terminator indicating end of a complete word
      string.chars.inject(self) do |h, char|
      h[char] ||= {}
      end
      end
      end

      # Build the trie from all the words
      @trie = Trie.new
      File.foreach('wordlist.txt') { |w| @trie.build(w.chomp) }


      Here I use a simplistic Trie structure based on a Hash in Ruby. It does the trick very nicely, but I am wondering whether there is a better way to implement this in Ruby that will consume less memory.
      (It lacks lookup methods which is perfectly fine, because I track through the Trie along with the path through letters on a board - the '.' acts as a marker for a complete word).



      When I load the file, which contains 270K+ words and is ~2.9MB in size, the memory consumption is over 188 megabytes.



      Is there a way to reduce the memory consumption, perhaps, implementing this differently?







      ruby memory-optimization trie






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 10 at 5:18









      mydoghasworms

      1214




      1214





      bumped to the homepage by Community yesterday


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community yesterday


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          Bloom Filters



          I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.



          A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
          1) The item is definitely not in the data structure
          2) The item is maybe in the data structure



          To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].



          Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.



          Simple concrete example



          Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).



          When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.



          Some notes



          1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.

          2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
          3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
          4) When confirming a word exist in your word list, you should, of course, use a binary search.



          Choosing m and k



          There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.






          share|improve this answer





















            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "196"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f203441%2ftrie-structure-using-hash-in-ruby%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            Bloom Filters



            I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.



            A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
            1) The item is definitely not in the data structure
            2) The item is maybe in the data structure



            To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].



            Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.



            Simple concrete example



            Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).



            When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.



            Some notes



            1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.

            2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
            3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
            4) When confirming a word exist in your word list, you should, of course, use a binary search.



            Choosing m and k



            There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.






            share|improve this answer

























              up vote
              0
              down vote













              Bloom Filters



              I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.



              A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
              1) The item is definitely not in the data structure
              2) The item is maybe in the data structure



              To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].



              Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.



              Simple concrete example



              Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).



              When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.



              Some notes



              1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.

              2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
              3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
              4) When confirming a word exist in your word list, you should, of course, use a binary search.



              Choosing m and k



              There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                Bloom Filters



                I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.



                A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
                1) The item is definitely not in the data structure
                2) The item is maybe in the data structure



                To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].



                Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.



                Simple concrete example



                Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).



                When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.



                Some notes



                1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.

                2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
                3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
                4) When confirming a word exist in your word list, you should, of course, use a binary search.



                Choosing m and k



                There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.






                share|improve this answer












                Bloom Filters



                I suggest a new data structure, specifically a Bloom Filter. A bloom filter is a boolean array of length m. You will also need k different hashing algorithms (more on m and k in a moment). Each hash should accept complete word and return a number 0 through m.



                A bloom filter is a big hashing data structure that returns one of two results when checking if an item is in the hash:
                1) The item is definitely not in the data structure
                2) The item is maybe in the data structure



                To use the bloom filter, you will load it with every word in your list. Specifically, for every word, load the first two letters, then first three, then so on, until the whole word is loaded. I like the idea of using a period to mark the end of the word, so do that too. For example, for the word 'cart', you would load your filter with ['ca', car', 'cart'].



                Then, on the boggle array, you can do an exhaustive DFS search fromevery starting position. if the letter combination (with or without period) is in the list, then you keep branching and searching. If the combination isn't in the list, you choose a different path, etc.



                Simple concrete example



                Here is a simple example of a bloom filter being used. Let's set m=100 (array of 100 booleans, initialized to true) and k=2 (use two different hashing algorithms).



                When hashing the word 'cat', hash1 might return 7 and hash2 might return 36, so you set indexes 7 and 36 to false. When hashing the word 'dog', hash1 might return 7 and hash2 might return 70, so those indices are set to false. Next, we check to see if 'fish' is in the array. Hash1 returns 80 and hash2 returns 7. Because all the indices are not false, we know that 'fish' is definitely not in the array. Finally, we check if 'bird' is in the array. Hash1 returns 70 and Hash2 returns 7. 'Bird' might be in the array and we will have to check the actual wordlist to confirm.



                Some notes



                1) You should be able to search using only the bloom filter until you are checking for a complete word (For example when checking for 'car.', and not checking 'car' as a substring of 'cart.' or 'carriage.'). Otherwise if the bloom filter says it's probably in the list, just assume it's true.

                2) Depending on how thorough your edge cases are, it may be possible to occasionally miss a possible word depending on the hash algoirthms chosen, size of filter, etc. In general, I would say that while you are not 100% guaranteed to find the highest scoring word, you will definitely find a high scoring word.
                3) The bloom filter here is chosen to reduce the memory space of the search. The running time may very well be longer than the trie structure.
                4) When confirming a word exist in your word list, you should, of course, use a binary search.



                Choosing m and k



                There is some math involved in choosing m and k. I'll spare you the finer details and suggest you look at wiki section about choosing optimal sizes.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Oct 4 at 14:37









                Zack

                1,736312




                1,736312






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Code Review Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f203441%2ftrie-structure-using-hash-in-ruby%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Morgemoulin

                    Scott Moir

                    Souastre