Cosine Similarity on Huge Dataset

up vote
8
down vote

favorite

I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:

userID, movieID, rating...

I have parsed the files, and I am now trying to compute cosine similarity of all 100,000 ratings for each movie. Thus, I'm using an ADT Hashmap to store the values of the ratings of each movie as follows HashMap. For each 1000 or so movie, I'm to compute the cosine Similarity. This is what i have done so far, what do you guys think?

import java.util.*;

import java.io.*;



public class MovieRatingParser {

    static HashMap<String, Double> ratings = new HashMap<>();



    public void parseMovieFile() throws FileNotFoundException, IOException {

        //Create an ArrayList to store movies

        ArrayList<Movie> movies = new ArrayList<Movie>(); 

        try {

        //Create a buffered file reader for FileReader to read in movies.dat

            BufferedReader br = new BufferedReader(new FileReader("movies.dat"));



            String readFile = br.readLine();

            while (readFile != null) {

                //Use String split delimiter to load each movie one by one

                //File delimiter is “\|"

                String tokenDelimiter = readFile.split("\|");

                String movieID = tokenDelimiter[0];

                String movieTitle = tokenDelimiter[1];





                Movie movieToAdd = new Movie(movieID, movieTitle);

                movies.add(movieToAdd);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("file was not Found!");

        }

        System.out.println("==============================================");



    }



    public static void parseRatingFile() throws FileNotFoundException, IOException{



        try {

            BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));

            String readFile = br.readLine();

            while (readFile != null) {

                String tokenDelimiter = readFile.split("\|");

                String userID = tokenDelimiter[0];

                String movieID = tokenDelimiter[1];

                double rating = Double.parseDouble(tokenDelimiter[2]);



                ratings.put(movieID, rating);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("File was not Found!");

        }

    }







    public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {



        double dotProduct = 0.0;

        double normA = 0.0;

        double normB = 0.0;

        parseRatingFile();

        for (int j = 0; j < ratings.size(); j++) {

            movieA.put(ratings.get(3), ratings.values());

        }



        for (int i = 0; i < movieA.size(); i++) {

            dotProduct += movieA[i] * movieB[i];

            normA += Math.pow(movieA[i], 2);

            normB += Math.pow(movieB[i], 2);

        }



        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));

    }









}

What can I do to improve the code? It looks very sloppy.

edited May 3 '17 at 12:50

200_success

127k15148412

asked May 3 '17 at 12:44

Al-geBra

411

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

3

Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04

1

How exactly do you plan to use the method computeCosineSimilarity ? I mostly find it strange that you would call parseRatingFile() each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the line movieA.put(ratings.get(3),ratings.values()); You sure this code works?
– Imus
May 3 '17 at 14:14

Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45

add a comment |

up vote
8
down vote

favorite

I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:

userID, movieID, rating...

import java.util.*;

import java.io.*;



public class MovieRatingParser {

    static HashMap<String, Double> ratings = new HashMap<>();



    public void parseMovieFile() throws FileNotFoundException, IOException {

        //Create an ArrayList to store movies

        ArrayList<Movie> movies = new ArrayList<Movie>(); 

        try {

        //Create a buffered file reader for FileReader to read in movies.dat

            BufferedReader br = new BufferedReader(new FileReader("movies.dat"));



            String readFile = br.readLine();

            while (readFile != null) {

                //Use String split delimiter to load each movie one by one

                //File delimiter is “\|"

                String tokenDelimiter = readFile.split("\|");

                String movieID = tokenDelimiter[0];

                String movieTitle = tokenDelimiter[1];





                Movie movieToAdd = new Movie(movieID, movieTitle);

                movies.add(movieToAdd);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("file was not Found!");

        }

        System.out.println("==============================================");



    }



    public static void parseRatingFile() throws FileNotFoundException, IOException{



        try {

            BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));

            String readFile = br.readLine();

            while (readFile != null) {

                String tokenDelimiter = readFile.split("\|");

                String userID = tokenDelimiter[0];

                String movieID = tokenDelimiter[1];

                double rating = Double.parseDouble(tokenDelimiter[2]);



                ratings.put(movieID, rating);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("File was not Found!");

        }

    }







    public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {



        double dotProduct = 0.0;

        double normA = 0.0;

        double normB = 0.0;

        parseRatingFile();

        for (int j = 0; j < ratings.size(); j++) {

            movieA.put(ratings.get(3), ratings.values());

        }



        for (int i = 0; i < movieA.size(); i++) {

            dotProduct += movieA[i] * movieB[i];

            normA += Math.pow(movieA[i], 2);

            normB += Math.pow(movieB[i], 2);

        }



        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));

    }









}

What can I do to improve the code? It looks very sloppy.

edited May 3 '17 at 12:50

200_success

127k15148412

asked May 3 '17 at 12:44

Al-geBra

411

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

3

Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04

1

How exactly do you plan to use the method computeCosineSimilarity ? I mostly find it strange that you would call parseRatingFile() each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the line movieA.put(ratings.get(3),ratings.values()); You sure this code works?
– Imus
May 3 '17 at 14:14

Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45

add a comment |

up vote
8
down vote

favorite

I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:

userID, movieID, rating...

import java.util.*;

import java.io.*;



public class MovieRatingParser {

    static HashMap<String, Double> ratings = new HashMap<>();



    public void parseMovieFile() throws FileNotFoundException, IOException {

        //Create an ArrayList to store movies

        ArrayList<Movie> movies = new ArrayList<Movie>(); 

        try {

        //Create a buffered file reader for FileReader to read in movies.dat

            BufferedReader br = new BufferedReader(new FileReader("movies.dat"));



            String readFile = br.readLine();

            while (readFile != null) {

                //Use String split delimiter to load each movie one by one

                //File delimiter is “\|"

                String tokenDelimiter = readFile.split("\|");

                String movieID = tokenDelimiter[0];

                String movieTitle = tokenDelimiter[1];





                Movie movieToAdd = new Movie(movieID, movieTitle);

                movies.add(movieToAdd);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("file was not Found!");

        }

        System.out.println("==============================================");



    }



    public static void parseRatingFile() throws FileNotFoundException, IOException{



        try {

            BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));

            String readFile = br.readLine();

            while (readFile != null) {

                String tokenDelimiter = readFile.split("\|");

                String userID = tokenDelimiter[0];

                String movieID = tokenDelimiter[1];

                double rating = Double.parseDouble(tokenDelimiter[2]);



                ratings.put(movieID, rating);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("File was not Found!");

        }

    }







    public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {



        double dotProduct = 0.0;

        double normA = 0.0;

        double normB = 0.0;

        parseRatingFile();

        for (int j = 0; j < ratings.size(); j++) {

            movieA.put(ratings.get(3), ratings.values());

        }



        for (int i = 0; i < movieA.size(); i++) {

            dotProduct += movieA[i] * movieB[i];

            normA += Math.pow(movieA[i], 2);

            normB += Math.pow(movieB[i], 2);

        }



        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));

    }









}

What can I do to improve the code? It looks very sloppy.

edited May 3 '17 at 12:50

200_success

127k15148412

asked May 3 '17 at 12:44

Al-geBra

411

I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:

userID, movieID, rating...

import java.util.*;

import java.io.*;



public class MovieRatingParser {

    static HashMap<String, Double> ratings = new HashMap<>();



    public void parseMovieFile() throws FileNotFoundException, IOException {

        //Create an ArrayList to store movies

        ArrayList<Movie> movies = new ArrayList<Movie>(); 

        try {

        //Create a buffered file reader for FileReader to read in movies.dat

            BufferedReader br = new BufferedReader(new FileReader("movies.dat"));



            String readFile = br.readLine();

            while (readFile != null) {

                //Use String split delimiter to load each movie one by one

                //File delimiter is “\|"

                String tokenDelimiter = readFile.split("\|");

                String movieID = tokenDelimiter[0];

                String movieTitle = tokenDelimiter[1];





                Movie movieToAdd = new Movie(movieID, movieTitle);

                movies.add(movieToAdd);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("file was not Found!");

        }

        System.out.println("==============================================");



    }



    public static void parseRatingFile() throws FileNotFoundException, IOException{



        try {

            BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));

            String readFile = br.readLine();

            while (readFile != null) {

                String tokenDelimiter = readFile.split("\|");

                String userID = tokenDelimiter[0];

                String movieID = tokenDelimiter[1];

                double rating = Double.parseDouble(tokenDelimiter[2]);



                ratings.put(movieID, rating);

                readFile = br.readLine();

            }

            br.close();

        } catch (FileNotFoundException e) {

            System.out.println("File was not Found!");

        }

    }







    public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {



        double dotProduct = 0.0;

        double normA = 0.0;

        double normB = 0.0;

        parseRatingFile();

        for (int j = 0; j < ratings.size(); j++) {

            movieA.put(ratings.get(3), ratings.values());

        }



        for (int i = 0; i < movieA.size(); i++) {

            dotProduct += movieA[i] * movieB[i];

            normA += Math.pow(movieA[i], 2);

            normB += Math.pow(movieB[i], 2);

        }



        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));

    }









}

What can I do to improve the code? It looks very sloppy.

java clustering data-mining

edited May 3 '17 at 12:50

200_success

127k15148412

asked May 3 '17 at 12:44

Al-geBra

411

edited May 3 '17 at 12:50

200_success

127k15148412

asked May 3 '17 at 12:44

Al-geBra

411

edited May 3 '17 at 12:50

200_success

127k15148412

edited May 3 '17 at 12:50

200_success

127k15148412

edited May 3 '17 at 12:50

200_success

127k15148412

asked May 3 '17 at 12:44

Al-geBra

411

asked May 3 '17 at 12:44

Al-geBra

411

asked May 3 '17 at 12:44

Al-geBra

411

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

3

Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04

1

How exactly do you plan to use the method computeCosineSimilarity ? I mostly find it strange that you would call parseRatingFile() each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the line movieA.put(ratings.get(3),ratings.values()); You sure this code works?
– Imus
May 3 '17 at 14:14

Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45

add a comment |

3

Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04

1

How exactly do you plan to use the method computeCosineSimilarity ? I mostly find it strange that you would call parseRatingFile() each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the line movieA.put(ratings.get(3),ratings.values()); You sure this code works?
– Imus
May 3 '17 at 14:14

Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45

Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04

How exactly do you plan to use the method computeCosineSimilarity ? I mostly find it strange that you would call parseRatingFile() each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the line movieA.put(ratings.get(3),ratings.values()); You sure this code works?
– Imus
May 3 '17 at 14:14

Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.

Use informative error messages. For instance, instead of:

    ...

    } catch (FileNotFoundException e) {

        System.out.println("file was not Found!");

    }

    ...

consider something like:

    ...

    } catch (FileNotFoundException e) {

        String detailedMessage = 

               format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());

        // BTW "movies.dat" can be extracted into constant.

        System.out.println(detailedMessage);

    }

    ...

In the latter snippet you can see that error message includes detailed info about what really happened. And please note that surround variable data: such placeholders not only help to see corner cases in log (for example, when empty name of input file was specified by mistake) but do grep (or any other text search) efficiently.

Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.

Move parsing logic, e.g.:

...

String tokenDelimiter = readFile.split("\|");

String userID = tokenDelimiter[0];

String movieID = tokenDelimiter[1];

double rating = Double.parseDouble(tokenDelimiter[2]);

...

into separate helper method like it's already done for computeCosineSimilarity().

After all "little" improvements are done you will see the code more clearly. Then you can concentrate on the algorithm (e.g. on pure logic), add checks for corner cases (like empty input file), use strict math for floating point numbers, handle encoding of input files gracefully, improve overall processing speed for large files, etc.

answered Jun 5 '17 at 14:59

flaz14

1263

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f162401%2fcosine-similarity-on-huge-dataset%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.

Use informative error messages. For instance, instead of:

    ...

    } catch (FileNotFoundException e) {

        System.out.println("file was not Found!");

    }

    ...

consider something like:

    ...

    } catch (FileNotFoundException e) {

        String detailedMessage = 

               format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());

        // BTW "movies.dat" can be extracted into constant.

        System.out.println(detailedMessage);

    }

    ...

Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.

Move parsing logic, e.g.:

...

String tokenDelimiter = readFile.split("\|");

String userID = tokenDelimiter[0];

String movieID = tokenDelimiter[1];

double rating = Double.parseDouble(tokenDelimiter[2]);

...

into separate helper method like it's already done for computeCosineSimilarity().

answered Jun 5 '17 at 14:59

flaz14

1263

add a comment |

up vote
0
down vote

I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.

Use informative error messages. For instance, instead of:

    ...

    } catch (FileNotFoundException e) {

        System.out.println("file was not Found!");

    }

    ...

consider something like:

    ...

    } catch (FileNotFoundException e) {

        String detailedMessage = 

               format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());

        // BTW "movies.dat" can be extracted into constant.

        System.out.println(detailedMessage);

    }

    ...

Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.

Move parsing logic, e.g.:

...

String tokenDelimiter = readFile.split("\|");

String userID = tokenDelimiter[0];

String movieID = tokenDelimiter[1];

double rating = Double.parseDouble(tokenDelimiter[2]);

...

into separate helper method like it's already done for computeCosineSimilarity().

answered Jun 5 '17 at 14:59

flaz14

1263

add a comment |

up vote
0
down vote

I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.

Use informative error messages. For instance, instead of:

    ...

    } catch (FileNotFoundException e) {

        System.out.println("file was not Found!");

    }

    ...

consider something like:

    ...

    } catch (FileNotFoundException e) {

        String detailedMessage = 

               format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());

        // BTW "movies.dat" can be extracted into constant.

        System.out.println(detailedMessage);

    }

    ...

Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.

Move parsing logic, e.g.:

...

String tokenDelimiter = readFile.split("\|");

String userID = tokenDelimiter[0];

String movieID = tokenDelimiter[1];

double rating = Double.parseDouble(tokenDelimiter[2]);

...

into separate helper method like it's already done for computeCosineSimilarity().

answered Jun 5 '17 at 14:59

flaz14

1263

I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.

Use informative error messages. For instance, instead of:

    ...

    } catch (FileNotFoundException e) {

        System.out.println("file was not Found!");

    }

    ...

consider something like:

    ...

    } catch (FileNotFoundException e) {

        String detailedMessage = 

               format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());

        // BTW "movies.dat" can be extracted into constant.

        System.out.println(detailedMessage);

    }

    ...

Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.

Move parsing logic, e.g.:

...

String tokenDelimiter = readFile.split("\|");

String userID = tokenDelimiter[0];

String movieID = tokenDelimiter[1];

double rating = Double.parseDouble(tokenDelimiter[2]);

...

into separate helper method like it's already done for computeCosineSimilarity().

answered Jun 5 '17 at 14:59

flaz14

1263

answered Jun 5 '17 at 14:59

flaz14

1263

answered Jun 5 '17 at 14:59

flaz14

1263

answered Jun 5 '17 at 14:59

flaz14

1263

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk