Cosine Similarity on Huge Dataset
up vote
8
down vote
favorite
I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:
userID, movieID, rating...
I have parsed the files, and I am now trying to compute cosine similarity of all 100,000 ratings for each movie. Thus, I'm using an ADT Hashmap to store the values of the ratings of each movie as follows HashMap. For each 1000 or so movie, I'm to compute the cosine Similarity. This is what i have done so far, what do you guys think?
import java.util.*;
import java.io.*;
public class MovieRatingParser {
static HashMap<String, Double> ratings = new HashMap<>();
public void parseMovieFile() throws FileNotFoundException, IOException {
//Create an ArrayList to store movies
ArrayList<Movie> movies = new ArrayList<Movie>();
try {
//Create a buffered file reader for FileReader to read in movies.dat
BufferedReader br = new BufferedReader(new FileReader("movies.dat"));
String readFile = br.readLine();
while (readFile != null) {
//Use String split delimiter to load each movie one by one
//File delimiter is “\|"
String tokenDelimiter = readFile.split("\|");
String movieID = tokenDelimiter[0];
String movieTitle = tokenDelimiter[1];
Movie movieToAdd = new Movie(movieID, movieTitle);
movies.add(movieToAdd);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
System.out.println("==============================================");
}
public static void parseRatingFile() throws FileNotFoundException, IOException{
try {
BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));
String readFile = br.readLine();
while (readFile != null) {
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
ratings.put(movieID, rating);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("File was not Found!");
}
}
public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
parseRatingFile();
for (int j = 0; j < ratings.size(); j++) {
movieA.put(ratings.get(3), ratings.values());
}
for (int i = 0; i < movieA.size(); i++) {
dotProduct += movieA[i] * movieB[i];
normA += Math.pow(movieA[i], 2);
normB += Math.pow(movieB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
}
What can I do to improve the code? It looks very sloppy.
java clustering data-mining
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
up vote
8
down vote
favorite
I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:
userID, movieID, rating...
I have parsed the files, and I am now trying to compute cosine similarity of all 100,000 ratings for each movie. Thus, I'm using an ADT Hashmap to store the values of the ratings of each movie as follows HashMap. For each 1000 or so movie, I'm to compute the cosine Similarity. This is what i have done so far, what do you guys think?
import java.util.*;
import java.io.*;
public class MovieRatingParser {
static HashMap<String, Double> ratings = new HashMap<>();
public void parseMovieFile() throws FileNotFoundException, IOException {
//Create an ArrayList to store movies
ArrayList<Movie> movies = new ArrayList<Movie>();
try {
//Create a buffered file reader for FileReader to read in movies.dat
BufferedReader br = new BufferedReader(new FileReader("movies.dat"));
String readFile = br.readLine();
while (readFile != null) {
//Use String split delimiter to load each movie one by one
//File delimiter is “\|"
String tokenDelimiter = readFile.split("\|");
String movieID = tokenDelimiter[0];
String movieTitle = tokenDelimiter[1];
Movie movieToAdd = new Movie(movieID, movieTitle);
movies.add(movieToAdd);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
System.out.println("==============================================");
}
public static void parseRatingFile() throws FileNotFoundException, IOException{
try {
BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));
String readFile = br.readLine();
while (readFile != null) {
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
ratings.put(movieID, rating);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("File was not Found!");
}
}
public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
parseRatingFile();
for (int j = 0; j < ratings.size(); j++) {
movieA.put(ratings.get(3), ratings.values());
}
for (int i = 0; i < movieA.size(); i++) {
dotProduct += movieA[i] * movieB[i];
normA += Math.pow(movieA[i], 2);
normB += Math.pow(movieB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
}
What can I do to improve the code? It looks very sloppy.
java clustering data-mining
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
3
Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04
1
How exactly do you plan to use the methodcomputeCosineSimilarity? I mostly find it strange that you would callparseRatingFile()each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the linemovieA.put(ratings.get(3),ratings.values());You sure this code works?
– Imus
May 3 '17 at 14:14
Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45
add a comment |
up vote
8
down vote
favorite
up vote
8
down vote
favorite
I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:
userID, movieID, rating...
I have parsed the files, and I am now trying to compute cosine similarity of all 100,000 ratings for each movie. Thus, I'm using an ADT Hashmap to store the values of the ratings of each movie as follows HashMap. For each 1000 or so movie, I'm to compute the cosine Similarity. This is what i have done so far, what do you guys think?
import java.util.*;
import java.io.*;
public class MovieRatingParser {
static HashMap<String, Double> ratings = new HashMap<>();
public void parseMovieFile() throws FileNotFoundException, IOException {
//Create an ArrayList to store movies
ArrayList<Movie> movies = new ArrayList<Movie>();
try {
//Create a buffered file reader for FileReader to read in movies.dat
BufferedReader br = new BufferedReader(new FileReader("movies.dat"));
String readFile = br.readLine();
while (readFile != null) {
//Use String split delimiter to load each movie one by one
//File delimiter is “\|"
String tokenDelimiter = readFile.split("\|");
String movieID = tokenDelimiter[0];
String movieTitle = tokenDelimiter[1];
Movie movieToAdd = new Movie(movieID, movieTitle);
movies.add(movieToAdd);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
System.out.println("==============================================");
}
public static void parseRatingFile() throws FileNotFoundException, IOException{
try {
BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));
String readFile = br.readLine();
while (readFile != null) {
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
ratings.put(movieID, rating);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("File was not Found!");
}
}
public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
parseRatingFile();
for (int j = 0; j < ratings.size(); j++) {
movieA.put(ratings.get(3), ratings.values());
}
for (int i = 0; i < movieA.size(); i++) {
dotProduct += movieA[i] * movieB[i];
normA += Math.pow(movieA[i], 2);
normB += Math.pow(movieB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
}
What can I do to improve the code? It looks very sloppy.
java clustering data-mining
I have a very large data file full of movie ratings that I am looking at for work. I wanted to do this in a clean and very effective manner. The ratings file contains on a per column by column basis:
userID, movieID, rating...
I have parsed the files, and I am now trying to compute cosine similarity of all 100,000 ratings for each movie. Thus, I'm using an ADT Hashmap to store the values of the ratings of each movie as follows HashMap. For each 1000 or so movie, I'm to compute the cosine Similarity. This is what i have done so far, what do you guys think?
import java.util.*;
import java.io.*;
public class MovieRatingParser {
static HashMap<String, Double> ratings = new HashMap<>();
public void parseMovieFile() throws FileNotFoundException, IOException {
//Create an ArrayList to store movies
ArrayList<Movie> movies = new ArrayList<Movie>();
try {
//Create a buffered file reader for FileReader to read in movies.dat
BufferedReader br = new BufferedReader(new FileReader("movies.dat"));
String readFile = br.readLine();
while (readFile != null) {
//Use String split delimiter to load each movie one by one
//File delimiter is “\|"
String tokenDelimiter = readFile.split("\|");
String movieID = tokenDelimiter[0];
String movieTitle = tokenDelimiter[1];
Movie movieToAdd = new Movie(movieID, movieTitle);
movies.add(movieToAdd);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
System.out.println("==============================================");
}
public static void parseRatingFile() throws FileNotFoundException, IOException{
try {
BufferedReader br = new BufferedReader(new FileReader("ratings.dat"));
String readFile = br.readLine();
while (readFile != null) {
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
ratings.put(movieID, rating);
readFile = br.readLine();
}
br.close();
} catch (FileNotFoundException e) {
System.out.println("File was not Found!");
}
}
public static double computeCosineSimilarity(HashMap<String, Double> movieA, HashMap<String, Double> movieB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
parseRatingFile();
for (int j = 0; j < ratings.size(); j++) {
movieA.put(ratings.get(3), ratings.values());
}
for (int i = 0; i < movieA.size(); i++) {
dotProduct += movieA[i] * movieB[i];
normA += Math.pow(movieA[i], 2);
normB += Math.pow(movieB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
}
What can I do to improve the code? It looks very sloppy.
java clustering data-mining
java clustering data-mining
edited May 3 '17 at 12:50
200_success
127k15148412
127k15148412
asked May 3 '17 at 12:44
Al-geBra
411
411
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 2 days ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
3
Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04
1
How exactly do you plan to use the methodcomputeCosineSimilarity? I mostly find it strange that you would callparseRatingFile()each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the linemovieA.put(ratings.get(3),ratings.values());You sure this code works?
– Imus
May 3 '17 at 14:14
Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45
add a comment |
3
Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)
– coderodde
May 3 '17 at 13:04
1
How exactly do you plan to use the methodcomputeCosineSimilarity? I mostly find it strange that you would callparseRatingFile()each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the linemovieA.put(ratings.get(3),ratings.values());You sure this code works?
– Imus
May 3 '17 at 14:14
Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45
3
3
Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)– coderodde
May 3 '17 at 13:04
Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)– coderodde
May 3 '17 at 13:04
1
1
How exactly do you plan to use the method
computeCosineSimilarity ? I mostly find it strange that you would call parseRatingFile() each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the line movieA.put(ratings.get(3),ratings.values()); You sure this code works?– Imus
May 3 '17 at 14:14
How exactly do you plan to use the method
computeCosineSimilarity ? I mostly find it strange that you would call parseRatingFile() each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the line movieA.put(ratings.get(3),ratings.values()); You sure this code works?– Imus
May 3 '17 at 14:14
Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45
Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.
Use informative error messages. For instance, instead of:
...
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
...
consider something like:
...
} catch (FileNotFoundException e) {
String detailedMessage =
format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());
// BTW "movies.dat" can be extracted into constant.
System.out.println(detailedMessage);
}
...
In the latter snippet you can see that error message includes detailed info about what really happened. And please note that surround variable data: such placeholders not only help to see corner cases in log (for example, when empty name of input file was specified by mistake) but do grep (or any other text search) efficiently.
Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.
Move parsing logic, e.g.:
...
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
...
into separate helper method like it's already done for computeCosineSimilarity().
After all "little" improvements are done you will see the code more clearly. Then you can concentrate on the algorithm (e.g. on pure logic), add checks for corner cases (like empty input file), use strict math for floating point numbers, handle encoding of input files gracefully, improve overall processing speed for large files, etc.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.
Use informative error messages. For instance, instead of:
...
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
...
consider something like:
...
} catch (FileNotFoundException e) {
String detailedMessage =
format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());
// BTW "movies.dat" can be extracted into constant.
System.out.println(detailedMessage);
}
...
In the latter snippet you can see that error message includes detailed info about what really happened. And please note that surround variable data: such placeholders not only help to see corner cases in log (for example, when empty name of input file was specified by mistake) but do grep (or any other text search) efficiently.
Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.
Move parsing logic, e.g.:
...
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
...
into separate helper method like it's already done for computeCosineSimilarity().
After all "little" improvements are done you will see the code more clearly. Then you can concentrate on the algorithm (e.g. on pure logic), add checks for corner cases (like empty input file), use strict math for floating point numbers, handle encoding of input files gracefully, improve overall processing speed for large files, etc.
add a comment |
up vote
0
down vote
I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.
Use informative error messages. For instance, instead of:
...
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
...
consider something like:
...
} catch (FileNotFoundException e) {
String detailedMessage =
format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());
// BTW "movies.dat" can be extracted into constant.
System.out.println(detailedMessage);
}
...
In the latter snippet you can see that error message includes detailed info about what really happened. And please note that surround variable data: such placeholders not only help to see corner cases in log (for example, when empty name of input file was specified by mistake) but do grep (or any other text search) efficiently.
Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.
Move parsing logic, e.g.:
...
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
...
into separate helper method like it's already done for computeCosineSimilarity().
After all "little" improvements are done you will see the code more clearly. Then you can concentrate on the algorithm (e.g. on pure logic), add checks for corner cases (like empty input file), use strict math for floating point numbers, handle encoding of input files gracefully, improve overall processing speed for large files, etc.
add a comment |
up vote
0
down vote
up vote
0
down vote
I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.
Use informative error messages. For instance, instead of:
...
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
...
consider something like:
...
} catch (FileNotFoundException e) {
String detailedMessage =
format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());
// BTW "movies.dat" can be extracted into constant.
System.out.println(detailedMessage);
}
...
In the latter snippet you can see that error message includes detailed info about what really happened. And please note that surround variable data: such placeholders not only help to see corner cases in log (for example, when empty name of input file was specified by mistake) but do grep (or any other text search) efficiently.
Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.
Move parsing logic, e.g.:
...
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
...
into separate helper method like it's already done for computeCosineSimilarity().
After all "little" improvements are done you will see the code more clearly. Then you can concentrate on the algorithm (e.g. on pure logic), add checks for corner cases (like empty input file), use strict math for floating point numbers, handle encoding of input files gracefully, improve overall processing speed for large files, etc.
I'm not familiar with the algorithm you've implemented. So I cannot point to improvements there. But some things in the code can be enhanced.
Use informative error messages. For instance, instead of:
...
} catch (FileNotFoundException e) {
System.out.println("file was not Found!");
}
...
consider something like:
...
} catch (FileNotFoundException e) {
String detailedMessage =
format("File [%s] was not found. Reason was [%s]!", "movies.dat", e.getMessage());
// BTW "movies.dat" can be extracted into constant.
System.out.println(detailedMessage);
}
...
In the latter snippet you can see that error message includes detailed info about what really happened. And please note that surround variable data: such placeholders not only help to see corner cases in log (for example, when empty name of input file was specified by mistake) but do grep (or any other text search) efficiently.
Consider try-with-resources. That will reduce amount of boilerplate code when dealing with readers.
Move parsing logic, e.g.:
...
String tokenDelimiter = readFile.split("\|");
String userID = tokenDelimiter[0];
String movieID = tokenDelimiter[1];
double rating = Double.parseDouble(tokenDelimiter[2]);
...
into separate helper method like it's already done for computeCosineSimilarity().
After all "little" improvements are done you will see the code more clearly. Then you can concentrate on the algorithm (e.g. on pure logic), add checks for corner cases (like empty input file), use strict math for floating point numbers, handle encoding of input files gracefully, improve overall processing speed for large files, etc.
answered Jun 5 '17 at 14:59
flaz14
1263
1263
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f162401%2fcosine-similarity-on-huge-dataset%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
Math.sqrt(normA) * Math.sqrt(normB) == Math.sqrt(normA * normB)– coderodde
May 3 '17 at 13:04
1
How exactly do you plan to use the method
computeCosineSimilarity? I mostly find it strange that you would callparseRatingFile()each time but store the result in a static list. Also, my IDE complains that the types are incorrect on the linemovieA.put(ratings.get(3),ratings.values());You sure this code works?– Imus
May 3 '17 at 14:14
Not sure what the purpose is. What does cosine similarity of all rating for a movie represent?
– paparazzo
Jun 1 at 12:45