How to simulate data to be statistically significant?
I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.
However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.
Is there any program that can help me achieve something like this? Or maybe a python module?
Any help whatsoever (even general comments on my project) will be extremely appreciated!!
machine-learning statistical-significance t-test python simulation
|
show 2 more comments
I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.
However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.
Is there any program that can help me achieve something like this? Or maybe a python module?
Any help whatsoever (even general comments on my project) will be extremely appreciated!!
machine-learning statistical-significance t-test python simulation
4
This is a great question. And something like this is what most scientists should be doing before applying for a grant, in the study design phase. I see far too often that people collect their data first and try to figure out how to analyze it afterwards, with the result that the statistician may only be able to tell what the experiment died of, in the words of Ronald Fisher.
– Stephan Kolassa
Dec 19 '18 at 8:53
@StephanKolassa However, is very hard to asses what data will be available in some experiments with human data, and in other setting one uses data that is available and cannot collect more...
– llrs
Dec 19 '18 at 11:11
2
@llrs: That is completely correct. And it should of course inform the simulation exercise. Better to think beforehand about what data are available, rather than finding out after the experiment that crucial data cannot be obtained.
– Stephan Kolassa
Dec 19 '18 at 11:46
(+1) I find the vote to close this question somewhat objectionable
– Robert Long
Dec 20 '18 at 18:24
@RobertLong, why do you say that? I ask simply because I want to make sure I am not missing anything in the response that makes it less credible.
– Neelasha Bhattacharjee
Dec 25 '18 at 3:46
|
show 2 more comments
I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.
However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.
Is there any program that can help me achieve something like this? Or maybe a python module?
Any help whatsoever (even general comments on my project) will be extremely appreciated!!
machine-learning statistical-significance t-test python simulation
I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.
However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.
Is there any program that can help me achieve something like this? Or maybe a python module?
Any help whatsoever (even general comments on my project) will be extremely appreciated!!
machine-learning statistical-significance t-test python simulation
machine-learning statistical-significance t-test python simulation
asked Dec 19 '18 at 3:38
Neelasha Bhattacharjee
916
916
4
This is a great question. And something like this is what most scientists should be doing before applying for a grant, in the study design phase. I see far too often that people collect their data first and try to figure out how to analyze it afterwards, with the result that the statistician may only be able to tell what the experiment died of, in the words of Ronald Fisher.
– Stephan Kolassa
Dec 19 '18 at 8:53
@StephanKolassa However, is very hard to asses what data will be available in some experiments with human data, and in other setting one uses data that is available and cannot collect more...
– llrs
Dec 19 '18 at 11:11
2
@llrs: That is completely correct. And it should of course inform the simulation exercise. Better to think beforehand about what data are available, rather than finding out after the experiment that crucial data cannot be obtained.
– Stephan Kolassa
Dec 19 '18 at 11:46
(+1) I find the vote to close this question somewhat objectionable
– Robert Long
Dec 20 '18 at 18:24
@RobertLong, why do you say that? I ask simply because I want to make sure I am not missing anything in the response that makes it less credible.
– Neelasha Bhattacharjee
Dec 25 '18 at 3:46
|
show 2 more comments
4
This is a great question. And something like this is what most scientists should be doing before applying for a grant, in the study design phase. I see far too often that people collect their data first and try to figure out how to analyze it afterwards, with the result that the statistician may only be able to tell what the experiment died of, in the words of Ronald Fisher.
– Stephan Kolassa
Dec 19 '18 at 8:53
@StephanKolassa However, is very hard to asses what data will be available in some experiments with human data, and in other setting one uses data that is available and cannot collect more...
– llrs
Dec 19 '18 at 11:11
2
@llrs: That is completely correct. And it should of course inform the simulation exercise. Better to think beforehand about what data are available, rather than finding out after the experiment that crucial data cannot be obtained.
– Stephan Kolassa
Dec 19 '18 at 11:46
(+1) I find the vote to close this question somewhat objectionable
– Robert Long
Dec 20 '18 at 18:24
@RobertLong, why do you say that? I ask simply because I want to make sure I am not missing anything in the response that makes it less credible.
– Neelasha Bhattacharjee
Dec 25 '18 at 3:46
4
4
This is a great question. And something like this is what most scientists should be doing before applying for a grant, in the study design phase. I see far too often that people collect their data first and try to figure out how to analyze it afterwards, with the result that the statistician may only be able to tell what the experiment died of, in the words of Ronald Fisher.
– Stephan Kolassa
Dec 19 '18 at 8:53
This is a great question. And something like this is what most scientists should be doing before applying for a grant, in the study design phase. I see far too often that people collect their data first and try to figure out how to analyze it afterwards, with the result that the statistician may only be able to tell what the experiment died of, in the words of Ronald Fisher.
– Stephan Kolassa
Dec 19 '18 at 8:53
@StephanKolassa However, is very hard to asses what data will be available in some experiments with human data, and in other setting one uses data that is available and cannot collect more...
– llrs
Dec 19 '18 at 11:11
@StephanKolassa However, is very hard to asses what data will be available in some experiments with human data, and in other setting one uses data that is available and cannot collect more...
– llrs
Dec 19 '18 at 11:11
2
2
@llrs: That is completely correct. And it should of course inform the simulation exercise. Better to think beforehand about what data are available, rather than finding out after the experiment that crucial data cannot be obtained.
– Stephan Kolassa
Dec 19 '18 at 11:46
@llrs: That is completely correct. And it should of course inform the simulation exercise. Better to think beforehand about what data are available, rather than finding out after the experiment that crucial data cannot be obtained.
– Stephan Kolassa
Dec 19 '18 at 11:46
(+1) I find the vote to close this question somewhat objectionable
– Robert Long
Dec 20 '18 at 18:24
(+1) I find the vote to close this question somewhat objectionable
– Robert Long
Dec 20 '18 at 18:24
@RobertLong, why do you say that? I ask simply because I want to make sure I am not missing anything in the response that makes it less credible.
– Neelasha Bhattacharjee
Dec 25 '18 at 3:46
@RobertLong, why do you say that? I ask simply because I want to make sure I am not missing anything in the response that makes it less credible.
– Neelasha Bhattacharjee
Dec 25 '18 at 3:46
|
show 2 more comments
3 Answers
3
active
oldest
votes
General Comments
"I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.
Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as
rnorm
for a normal distribution,runif
for the uniform distribution,rbeta
for the beta distribution, and so on. In R, typing in?Distributions
will show you a help page on them. However, there are many other cool packages likemvtnorm
orsimstudy
that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to thingsIt seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.
Specific Comments
It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.
You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.
Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.
Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom
for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #
:
set.seed(1839) # this makes sure the results are replicable when you do it
n <- 1000 # sample size is 1000
times <- c("morning", "afternoon", "evening") # create a vector of times
time <- sample(times, n, TRUE) # create our time variable
# make adherence probabilities based on time
adhere_prob <- ifelse(
time == "morning", .80,
ifelse(
time == "afternoon", .50, .65
)
)
# simulate observations from binomial distribution with those probabilities
adhere <- rbinom(n, 1, adhere_prob)
# run a logistic regression, predicting adherence from time
model <- glm(adhere ~ time, family = binomial)
summary(model)
This summary shows, in part:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02882 0.10738 0.268 0.78839
timeevening 0.45350 0.15779 2.874 0.00405 **
timemorning 1.39891 0.17494 7.996 1.28e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The Intercept
represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).
Lastly, you can also simulate having no effect by setting all of the times to have the same probability:
set.seed(1839)
n <- 1000
times <- c("morning", "afternoon", "evening")
time <- sample(times, n, TRUE)
adhere <- rbinom(n, 1, .6) # same for all times
summary(glm(adhere ~ time, binomial))
Which returns:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.40306 0.10955 3.679 0.000234 ***
timeevening -0.06551 0.15806 -0.414 0.678535
timemorning 0.18472 0.15800 1.169 0.242360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This shows no significant differences between the times, as we would expect from the probability being the same across times.
Thank you so much for the book recommendation! Just what I needed for holiday reading!
– MD-Tech
Dec 19 '18 at 12:00
Thank you so much for this! I knew that I needed a logistic regression model for the machine learning aspect of my project, but it seems to have an application in simulating the data as well. However, I was under the impression that logistic regression requires for the order of the times to matter, but in this case that is not the case as each time is a different category with no relation to the other. I came to this conclusion after discussing with my math teacher, but we both could very well be wrong. Could you please clarify why exactly logistic regression can be used here?
– Neelasha Bhattacharjee
Dec 20 '18 at 3:01
@NeelashaBhattacharjee Simulating the data and fitting the logistic regression model are two separate steps—we could have simulated the same data and analyzed it using a contingency table and chi-square statistic if we wanted to. You're correct that the model I fit doesn't encode any order in the times. However, regression models make assumptions on how the dependent variable is distributed, not the independent variables. We could have ordered predictors, continuous predictors, count predictors, etc., and all of them would be fine for logistic regression.
– Mark White
Dec 20 '18 at 4:23
@NeelashaBhattacharjee Logistic regression can be used here since we are modeling a dichotomous dependent variable—that is, one with two and only two possible outcomes. What a logistic regression does is use the "logistic link function" to make all of the predicted values for the regression equation (e.g., b0 + b1 * x) fit between 0 and 1. And we call these numbers the probability that someone has the dependent variable value of 1.
– Mark White
Dec 20 '18 at 4:24
Thank you so much! However, I was wondering how you were able to look at the p values between the two simulated data sets and determine whether one had a significant trend and the other. To me, both sets have p values which vary enough to be significant.
– Neelasha Bhattacharjee
Dec 20 '18 at 19:40
|
show 2 more comments
If you already know some Python, then you will definitely be able to achieve what you need using base Python along with numpy
and/or pandas
. As Mark White suggests though, a lot of simulation and stats-related stuff is baked into R, so definitely worth a look.
Below is a basic framework for how you might approach this using a Python class. You could use np.random.normal
to adjust the baseline_adherence
of each subject to insert some noise. This gives you a pseudo-random adherence, to which you can add the targeted reduced adherence on specific days.
import pandas as pd
import numpy as np
from itertools import product
class Patient:
def __init__(self, number, baseline_adherence=0.95):
self.number = number
self.baseline_adherence = baseline_adherence
self.schedule = self.create_schedule()
def __repr__(self):
return "I am patient number {}".format(self.number)
def create_schedule(self):
time_slots =
for (day, time) in product(range(1, 8), range(1, 4)):
time_slots.append("Day {}; Slot {}".format(day, time))
week_labels = ["Week {}".format(x) for x in range(1, 31)]
df = pd.DataFrame(np.random.choice([0, 1],
size=(30, 21),#1 row per week, 1 column per time slot
p=(1-self.baseline_adherence, self.baseline_adherence)),
index=week_labels,
columns=time_slots
)
return df
def targeted_adherence(self, timeslot, adherence=0.8):
if timeslot in self.schedule.columns:
ad = np.random.choice([0, 1],
size=self.schedule[timeslot].shape,
p=(1-adherence, adherence)
)
self.schedule[timeslot] = ad
sim_patients = [Patient(x) for x in range(10)]
p = sim_patients[0]
p.targeted_adherence("Day 1; Slot 3")
add a comment |
This is a great project. There is a challenge for projects like this, and your method of using simulated data is a great way of assessing it.
Do you have an a priori hypothesis, e.g. "people are more forgetful in the evening"? In that case, a statistical test that compares the frequency of forgetting in the evening compared to the morning will test it. This is a Bernoulli distribution, as previous responders said.
The other approach is to trawl your data to find out which time slot has the highest rate of failure. There is bound to be one, so the question is "is this just a chance result?". The threshold for significance is higher in this case. If you want to read up about this, search for "false discovery rate".
In your case the system is simple enough that you can calculate the threshold with a bit of thought. But the general method could also be used: similate 1000 datasets with no rate variation, then find out the frequency distribution of coincidental low numbers. Compare your real dataset to it. If 1pm is the sparse slot in the real data, but 50/1000 simulated datasets have an equally sparse slot, then the result is not robust.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f383710%2fhow-to-simulate-data-to-be-statistically-significant%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
General Comments
"I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.
Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as
rnorm
for a normal distribution,runif
for the uniform distribution,rbeta
for the beta distribution, and so on. In R, typing in?Distributions
will show you a help page on them. However, there are many other cool packages likemvtnorm
orsimstudy
that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to thingsIt seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.
Specific Comments
It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.
You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.
Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.
Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom
for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #
:
set.seed(1839) # this makes sure the results are replicable when you do it
n <- 1000 # sample size is 1000
times <- c("morning", "afternoon", "evening") # create a vector of times
time <- sample(times, n, TRUE) # create our time variable
# make adherence probabilities based on time
adhere_prob <- ifelse(
time == "morning", .80,
ifelse(
time == "afternoon", .50, .65
)
)
# simulate observations from binomial distribution with those probabilities
adhere <- rbinom(n, 1, adhere_prob)
# run a logistic regression, predicting adherence from time
model <- glm(adhere ~ time, family = binomial)
summary(model)
This summary shows, in part:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02882 0.10738 0.268 0.78839
timeevening 0.45350 0.15779 2.874 0.00405 **
timemorning 1.39891 0.17494 7.996 1.28e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The Intercept
represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).
Lastly, you can also simulate having no effect by setting all of the times to have the same probability:
set.seed(1839)
n <- 1000
times <- c("morning", "afternoon", "evening")
time <- sample(times, n, TRUE)
adhere <- rbinom(n, 1, .6) # same for all times
summary(glm(adhere ~ time, binomial))
Which returns:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.40306 0.10955 3.679 0.000234 ***
timeevening -0.06551 0.15806 -0.414 0.678535
timemorning 0.18472 0.15800 1.169 0.242360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This shows no significant differences between the times, as we would expect from the probability being the same across times.
Thank you so much for the book recommendation! Just what I needed for holiday reading!
– MD-Tech
Dec 19 '18 at 12:00
Thank you so much for this! I knew that I needed a logistic regression model for the machine learning aspect of my project, but it seems to have an application in simulating the data as well. However, I was under the impression that logistic regression requires for the order of the times to matter, but in this case that is not the case as each time is a different category with no relation to the other. I came to this conclusion after discussing with my math teacher, but we both could very well be wrong. Could you please clarify why exactly logistic regression can be used here?
– Neelasha Bhattacharjee
Dec 20 '18 at 3:01
@NeelashaBhattacharjee Simulating the data and fitting the logistic regression model are two separate steps—we could have simulated the same data and analyzed it using a contingency table and chi-square statistic if we wanted to. You're correct that the model I fit doesn't encode any order in the times. However, regression models make assumptions on how the dependent variable is distributed, not the independent variables. We could have ordered predictors, continuous predictors, count predictors, etc., and all of them would be fine for logistic regression.
– Mark White
Dec 20 '18 at 4:23
@NeelashaBhattacharjee Logistic regression can be used here since we are modeling a dichotomous dependent variable—that is, one with two and only two possible outcomes. What a logistic regression does is use the "logistic link function" to make all of the predicted values for the regression equation (e.g., b0 + b1 * x) fit between 0 and 1. And we call these numbers the probability that someone has the dependent variable value of 1.
– Mark White
Dec 20 '18 at 4:24
Thank you so much! However, I was wondering how you were able to look at the p values between the two simulated data sets and determine whether one had a significant trend and the other. To me, both sets have p values which vary enough to be significant.
– Neelasha Bhattacharjee
Dec 20 '18 at 19:40
|
show 2 more comments
General Comments
"I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.
Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as
rnorm
for a normal distribution,runif
for the uniform distribution,rbeta
for the beta distribution, and so on. In R, typing in?Distributions
will show you a help page on them. However, there are many other cool packages likemvtnorm
orsimstudy
that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to thingsIt seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.
Specific Comments
It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.
You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.
Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.
Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom
for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #
:
set.seed(1839) # this makes sure the results are replicable when you do it
n <- 1000 # sample size is 1000
times <- c("morning", "afternoon", "evening") # create a vector of times
time <- sample(times, n, TRUE) # create our time variable
# make adherence probabilities based on time
adhere_prob <- ifelse(
time == "morning", .80,
ifelse(
time == "afternoon", .50, .65
)
)
# simulate observations from binomial distribution with those probabilities
adhere <- rbinom(n, 1, adhere_prob)
# run a logistic regression, predicting adherence from time
model <- glm(adhere ~ time, family = binomial)
summary(model)
This summary shows, in part:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02882 0.10738 0.268 0.78839
timeevening 0.45350 0.15779 2.874 0.00405 **
timemorning 1.39891 0.17494 7.996 1.28e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The Intercept
represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).
Lastly, you can also simulate having no effect by setting all of the times to have the same probability:
set.seed(1839)
n <- 1000
times <- c("morning", "afternoon", "evening")
time <- sample(times, n, TRUE)
adhere <- rbinom(n, 1, .6) # same for all times
summary(glm(adhere ~ time, binomial))
Which returns:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.40306 0.10955 3.679 0.000234 ***
timeevening -0.06551 0.15806 -0.414 0.678535
timemorning 0.18472 0.15800 1.169 0.242360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This shows no significant differences between the times, as we would expect from the probability being the same across times.
Thank you so much for the book recommendation! Just what I needed for holiday reading!
– MD-Tech
Dec 19 '18 at 12:00
Thank you so much for this! I knew that I needed a logistic regression model for the machine learning aspect of my project, but it seems to have an application in simulating the data as well. However, I was under the impression that logistic regression requires for the order of the times to matter, but in this case that is not the case as each time is a different category with no relation to the other. I came to this conclusion after discussing with my math teacher, but we both could very well be wrong. Could you please clarify why exactly logistic regression can be used here?
– Neelasha Bhattacharjee
Dec 20 '18 at 3:01
@NeelashaBhattacharjee Simulating the data and fitting the logistic regression model are two separate steps—we could have simulated the same data and analyzed it using a contingency table and chi-square statistic if we wanted to. You're correct that the model I fit doesn't encode any order in the times. However, regression models make assumptions on how the dependent variable is distributed, not the independent variables. We could have ordered predictors, continuous predictors, count predictors, etc., and all of them would be fine for logistic regression.
– Mark White
Dec 20 '18 at 4:23
@NeelashaBhattacharjee Logistic regression can be used here since we are modeling a dichotomous dependent variable—that is, one with two and only two possible outcomes. What a logistic regression does is use the "logistic link function" to make all of the predicted values for the regression equation (e.g., b0 + b1 * x) fit between 0 and 1. And we call these numbers the probability that someone has the dependent variable value of 1.
– Mark White
Dec 20 '18 at 4:24
Thank you so much! However, I was wondering how you were able to look at the p values between the two simulated data sets and determine whether one had a significant trend and the other. To me, both sets have p values which vary enough to be significant.
– Neelasha Bhattacharjee
Dec 20 '18 at 19:40
|
show 2 more comments
General Comments
"I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.
Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as
rnorm
for a normal distribution,runif
for the uniform distribution,rbeta
for the beta distribution, and so on. In R, typing in?Distributions
will show you a help page on them. However, there are many other cool packages likemvtnorm
orsimstudy
that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to thingsIt seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.
Specific Comments
It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.
You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.
Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.
Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom
for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #
:
set.seed(1839) # this makes sure the results are replicable when you do it
n <- 1000 # sample size is 1000
times <- c("morning", "afternoon", "evening") # create a vector of times
time <- sample(times, n, TRUE) # create our time variable
# make adherence probabilities based on time
adhere_prob <- ifelse(
time == "morning", .80,
ifelse(
time == "afternoon", .50, .65
)
)
# simulate observations from binomial distribution with those probabilities
adhere <- rbinom(n, 1, adhere_prob)
# run a logistic regression, predicting adherence from time
model <- glm(adhere ~ time, family = binomial)
summary(model)
This summary shows, in part:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02882 0.10738 0.268 0.78839
timeevening 0.45350 0.15779 2.874 0.00405 **
timemorning 1.39891 0.17494 7.996 1.28e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The Intercept
represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).
Lastly, you can also simulate having no effect by setting all of the times to have the same probability:
set.seed(1839)
n <- 1000
times <- c("morning", "afternoon", "evening")
time <- sample(times, n, TRUE)
adhere <- rbinom(n, 1, .6) # same for all times
summary(glm(adhere ~ time, binomial))
Which returns:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.40306 0.10955 3.679 0.000234 ***
timeevening -0.06551 0.15806 -0.414 0.678535
timemorning 0.18472 0.15800 1.169 0.242360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This shows no significant differences between the times, as we would expect from the probability being the same across times.
General Comments
"I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.
Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as
rnorm
for a normal distribution,runif
for the uniform distribution,rbeta
for the beta distribution, and so on. In R, typing in?Distributions
will show you a help page on them. However, there are many other cool packages likemvtnorm
orsimstudy
that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to thingsIt seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.
Specific Comments
It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.
You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.
Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.
Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom
for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #
:
set.seed(1839) # this makes sure the results are replicable when you do it
n <- 1000 # sample size is 1000
times <- c("morning", "afternoon", "evening") # create a vector of times
time <- sample(times, n, TRUE) # create our time variable
# make adherence probabilities based on time
adhere_prob <- ifelse(
time == "morning", .80,
ifelse(
time == "afternoon", .50, .65
)
)
# simulate observations from binomial distribution with those probabilities
adhere <- rbinom(n, 1, adhere_prob)
# run a logistic regression, predicting adherence from time
model <- glm(adhere ~ time, family = binomial)
summary(model)
This summary shows, in part:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02882 0.10738 0.268 0.78839
timeevening 0.45350 0.15779 2.874 0.00405 **
timemorning 1.39891 0.17494 7.996 1.28e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The Intercept
represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).
Lastly, you can also simulate having no effect by setting all of the times to have the same probability:
set.seed(1839)
n <- 1000
times <- c("morning", "afternoon", "evening")
time <- sample(times, n, TRUE)
adhere <- rbinom(n, 1, .6) # same for all times
summary(glm(adhere ~ time, binomial))
Which returns:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.40306 0.10955 3.679 0.000234 ***
timeevening -0.06551 0.15806 -0.414 0.678535
timemorning 0.18472 0.15800 1.169 0.242360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This shows no significant differences between the times, as we would expect from the probability being the same across times.
edited Dec 19 '18 at 5:34
answered Dec 19 '18 at 5:19
Mark White
5,65231344
5,65231344
Thank you so much for the book recommendation! Just what I needed for holiday reading!
– MD-Tech
Dec 19 '18 at 12:00
Thank you so much for this! I knew that I needed a logistic regression model for the machine learning aspect of my project, but it seems to have an application in simulating the data as well. However, I was under the impression that logistic regression requires for the order of the times to matter, but in this case that is not the case as each time is a different category with no relation to the other. I came to this conclusion after discussing with my math teacher, but we both could very well be wrong. Could you please clarify why exactly logistic regression can be used here?
– Neelasha Bhattacharjee
Dec 20 '18 at 3:01
@NeelashaBhattacharjee Simulating the data and fitting the logistic regression model are two separate steps—we could have simulated the same data and analyzed it using a contingency table and chi-square statistic if we wanted to. You're correct that the model I fit doesn't encode any order in the times. However, regression models make assumptions on how the dependent variable is distributed, not the independent variables. We could have ordered predictors, continuous predictors, count predictors, etc., and all of them would be fine for logistic regression.
– Mark White
Dec 20 '18 at 4:23
@NeelashaBhattacharjee Logistic regression can be used here since we are modeling a dichotomous dependent variable—that is, one with two and only two possible outcomes. What a logistic regression does is use the "logistic link function" to make all of the predicted values for the regression equation (e.g., b0 + b1 * x) fit between 0 and 1. And we call these numbers the probability that someone has the dependent variable value of 1.
– Mark White
Dec 20 '18 at 4:24
Thank you so much! However, I was wondering how you were able to look at the p values between the two simulated data sets and determine whether one had a significant trend and the other. To me, both sets have p values which vary enough to be significant.
– Neelasha Bhattacharjee
Dec 20 '18 at 19:40
|
show 2 more comments
Thank you so much for the book recommendation! Just what I needed for holiday reading!
– MD-Tech
Dec 19 '18 at 12:00
Thank you so much for this! I knew that I needed a logistic regression model for the machine learning aspect of my project, but it seems to have an application in simulating the data as well. However, I was under the impression that logistic regression requires for the order of the times to matter, but in this case that is not the case as each time is a different category with no relation to the other. I came to this conclusion after discussing with my math teacher, but we both could very well be wrong. Could you please clarify why exactly logistic regression can be used here?
– Neelasha Bhattacharjee
Dec 20 '18 at 3:01
@NeelashaBhattacharjee Simulating the data and fitting the logistic regression model are two separate steps—we could have simulated the same data and analyzed it using a contingency table and chi-square statistic if we wanted to. You're correct that the model I fit doesn't encode any order in the times. However, regression models make assumptions on how the dependent variable is distributed, not the independent variables. We could have ordered predictors, continuous predictors, count predictors, etc., and all of them would be fine for logistic regression.
– Mark White
Dec 20 '18 at 4:23
@NeelashaBhattacharjee Logistic regression can be used here since we are modeling a dichotomous dependent variable—that is, one with two and only two possible outcomes. What a logistic regression does is use the "logistic link function" to make all of the predicted values for the regression equation (e.g., b0 + b1 * x) fit between 0 and 1. And we call these numbers the probability that someone has the dependent variable value of 1.
– Mark White
Dec 20 '18 at 4:24
Thank you so much! However, I was wondering how you were able to look at the p values between the two simulated data sets and determine whether one had a significant trend and the other. To me, both sets have p values which vary enough to be significant.
– Neelasha Bhattacharjee
Dec 20 '18 at 19:40
Thank you so much for the book recommendation! Just what I needed for holiday reading!
– MD-Tech
Dec 19 '18 at 12:00
Thank you so much for the book recommendation! Just what I needed for holiday reading!
– MD-Tech
Dec 19 '18 at 12:00
Thank you so much for this! I knew that I needed a logistic regression model for the machine learning aspect of my project, but it seems to have an application in simulating the data as well. However, I was under the impression that logistic regression requires for the order of the times to matter, but in this case that is not the case as each time is a different category with no relation to the other. I came to this conclusion after discussing with my math teacher, but we both could very well be wrong. Could you please clarify why exactly logistic regression can be used here?
– Neelasha Bhattacharjee
Dec 20 '18 at 3:01
Thank you so much for this! I knew that I needed a logistic regression model for the machine learning aspect of my project, but it seems to have an application in simulating the data as well. However, I was under the impression that logistic regression requires for the order of the times to matter, but in this case that is not the case as each time is a different category with no relation to the other. I came to this conclusion after discussing with my math teacher, but we both could very well be wrong. Could you please clarify why exactly logistic regression can be used here?
– Neelasha Bhattacharjee
Dec 20 '18 at 3:01
@NeelashaBhattacharjee Simulating the data and fitting the logistic regression model are two separate steps—we could have simulated the same data and analyzed it using a contingency table and chi-square statistic if we wanted to. You're correct that the model I fit doesn't encode any order in the times. However, regression models make assumptions on how the dependent variable is distributed, not the independent variables. We could have ordered predictors, continuous predictors, count predictors, etc., and all of them would be fine for logistic regression.
– Mark White
Dec 20 '18 at 4:23
@NeelashaBhattacharjee Simulating the data and fitting the logistic regression model are two separate steps—we could have simulated the same data and analyzed it using a contingency table and chi-square statistic if we wanted to. You're correct that the model I fit doesn't encode any order in the times. However, regression models make assumptions on how the dependent variable is distributed, not the independent variables. We could have ordered predictors, continuous predictors, count predictors, etc., and all of them would be fine for logistic regression.
– Mark White
Dec 20 '18 at 4:23
@NeelashaBhattacharjee Logistic regression can be used here since we are modeling a dichotomous dependent variable—that is, one with two and only two possible outcomes. What a logistic regression does is use the "logistic link function" to make all of the predicted values for the regression equation (e.g., b0 + b1 * x) fit between 0 and 1. And we call these numbers the probability that someone has the dependent variable value of 1.
– Mark White
Dec 20 '18 at 4:24
@NeelashaBhattacharjee Logistic regression can be used here since we are modeling a dichotomous dependent variable—that is, one with two and only two possible outcomes. What a logistic regression does is use the "logistic link function" to make all of the predicted values for the regression equation (e.g., b0 + b1 * x) fit between 0 and 1. And we call these numbers the probability that someone has the dependent variable value of 1.
– Mark White
Dec 20 '18 at 4:24
Thank you so much! However, I was wondering how you were able to look at the p values between the two simulated data sets and determine whether one had a significant trend and the other. To me, both sets have p values which vary enough to be significant.
– Neelasha Bhattacharjee
Dec 20 '18 at 19:40
Thank you so much! However, I was wondering how you were able to look at the p values between the two simulated data sets and determine whether one had a significant trend and the other. To me, both sets have p values which vary enough to be significant.
– Neelasha Bhattacharjee
Dec 20 '18 at 19:40
|
show 2 more comments
If you already know some Python, then you will definitely be able to achieve what you need using base Python along with numpy
and/or pandas
. As Mark White suggests though, a lot of simulation and stats-related stuff is baked into R, so definitely worth a look.
Below is a basic framework for how you might approach this using a Python class. You could use np.random.normal
to adjust the baseline_adherence
of each subject to insert some noise. This gives you a pseudo-random adherence, to which you can add the targeted reduced adherence on specific days.
import pandas as pd
import numpy as np
from itertools import product
class Patient:
def __init__(self, number, baseline_adherence=0.95):
self.number = number
self.baseline_adherence = baseline_adherence
self.schedule = self.create_schedule()
def __repr__(self):
return "I am patient number {}".format(self.number)
def create_schedule(self):
time_slots =
for (day, time) in product(range(1, 8), range(1, 4)):
time_slots.append("Day {}; Slot {}".format(day, time))
week_labels = ["Week {}".format(x) for x in range(1, 31)]
df = pd.DataFrame(np.random.choice([0, 1],
size=(30, 21),#1 row per week, 1 column per time slot
p=(1-self.baseline_adherence, self.baseline_adherence)),
index=week_labels,
columns=time_slots
)
return df
def targeted_adherence(self, timeslot, adherence=0.8):
if timeslot in self.schedule.columns:
ad = np.random.choice([0, 1],
size=self.schedule[timeslot].shape,
p=(1-adherence, adherence)
)
self.schedule[timeslot] = ad
sim_patients = [Patient(x) for x in range(10)]
p = sim_patients[0]
p.targeted_adherence("Day 1; Slot 3")
add a comment |
If you already know some Python, then you will definitely be able to achieve what you need using base Python along with numpy
and/or pandas
. As Mark White suggests though, a lot of simulation and stats-related stuff is baked into R, so definitely worth a look.
Below is a basic framework for how you might approach this using a Python class. You could use np.random.normal
to adjust the baseline_adherence
of each subject to insert some noise. This gives you a pseudo-random adherence, to which you can add the targeted reduced adherence on specific days.
import pandas as pd
import numpy as np
from itertools import product
class Patient:
def __init__(self, number, baseline_adherence=0.95):
self.number = number
self.baseline_adherence = baseline_adherence
self.schedule = self.create_schedule()
def __repr__(self):
return "I am patient number {}".format(self.number)
def create_schedule(self):
time_slots =
for (day, time) in product(range(1, 8), range(1, 4)):
time_slots.append("Day {}; Slot {}".format(day, time))
week_labels = ["Week {}".format(x) for x in range(1, 31)]
df = pd.DataFrame(np.random.choice([0, 1],
size=(30, 21),#1 row per week, 1 column per time slot
p=(1-self.baseline_adherence, self.baseline_adherence)),
index=week_labels,
columns=time_slots
)
return df
def targeted_adherence(self, timeslot, adherence=0.8):
if timeslot in self.schedule.columns:
ad = np.random.choice([0, 1],
size=self.schedule[timeslot].shape,
p=(1-adherence, adherence)
)
self.schedule[timeslot] = ad
sim_patients = [Patient(x) for x in range(10)]
p = sim_patients[0]
p.targeted_adherence("Day 1; Slot 3")
add a comment |
If you already know some Python, then you will definitely be able to achieve what you need using base Python along with numpy
and/or pandas
. As Mark White suggests though, a lot of simulation and stats-related stuff is baked into R, so definitely worth a look.
Below is a basic framework for how you might approach this using a Python class. You could use np.random.normal
to adjust the baseline_adherence
of each subject to insert some noise. This gives you a pseudo-random adherence, to which you can add the targeted reduced adherence on specific days.
import pandas as pd
import numpy as np
from itertools import product
class Patient:
def __init__(self, number, baseline_adherence=0.95):
self.number = number
self.baseline_adherence = baseline_adherence
self.schedule = self.create_schedule()
def __repr__(self):
return "I am patient number {}".format(self.number)
def create_schedule(self):
time_slots =
for (day, time) in product(range(1, 8), range(1, 4)):
time_slots.append("Day {}; Slot {}".format(day, time))
week_labels = ["Week {}".format(x) for x in range(1, 31)]
df = pd.DataFrame(np.random.choice([0, 1],
size=(30, 21),#1 row per week, 1 column per time slot
p=(1-self.baseline_adherence, self.baseline_adherence)),
index=week_labels,
columns=time_slots
)
return df
def targeted_adherence(self, timeslot, adherence=0.8):
if timeslot in self.schedule.columns:
ad = np.random.choice([0, 1],
size=self.schedule[timeslot].shape,
p=(1-adherence, adherence)
)
self.schedule[timeslot] = ad
sim_patients = [Patient(x) for x in range(10)]
p = sim_patients[0]
p.targeted_adherence("Day 1; Slot 3")
If you already know some Python, then you will definitely be able to achieve what you need using base Python along with numpy
and/or pandas
. As Mark White suggests though, a lot of simulation and stats-related stuff is baked into R, so definitely worth a look.
Below is a basic framework for how you might approach this using a Python class. You could use np.random.normal
to adjust the baseline_adherence
of each subject to insert some noise. This gives you a pseudo-random adherence, to which you can add the targeted reduced adherence on specific days.
import pandas as pd
import numpy as np
from itertools import product
class Patient:
def __init__(self, number, baseline_adherence=0.95):
self.number = number
self.baseline_adherence = baseline_adherence
self.schedule = self.create_schedule()
def __repr__(self):
return "I am patient number {}".format(self.number)
def create_schedule(self):
time_slots =
for (day, time) in product(range(1, 8), range(1, 4)):
time_slots.append("Day {}; Slot {}".format(day, time))
week_labels = ["Week {}".format(x) for x in range(1, 31)]
df = pd.DataFrame(np.random.choice([0, 1],
size=(30, 21),#1 row per week, 1 column per time slot
p=(1-self.baseline_adherence, self.baseline_adherence)),
index=week_labels,
columns=time_slots
)
return df
def targeted_adherence(self, timeslot, adherence=0.8):
if timeslot in self.schedule.columns:
ad = np.random.choice([0, 1],
size=self.schedule[timeslot].shape,
p=(1-adherence, adherence)
)
self.schedule[timeslot] = ad
sim_patients = [Patient(x) for x in range(10)]
p = sim_patients[0]
p.targeted_adherence("Day 1; Slot 3")
answered Dec 20 '18 at 7:25
Andrew
1311
1311
add a comment |
add a comment |
This is a great project. There is a challenge for projects like this, and your method of using simulated data is a great way of assessing it.
Do you have an a priori hypothesis, e.g. "people are more forgetful in the evening"? In that case, a statistical test that compares the frequency of forgetting in the evening compared to the morning will test it. This is a Bernoulli distribution, as previous responders said.
The other approach is to trawl your data to find out which time slot has the highest rate of failure. There is bound to be one, so the question is "is this just a chance result?". The threshold for significance is higher in this case. If you want to read up about this, search for "false discovery rate".
In your case the system is simple enough that you can calculate the threshold with a bit of thought. But the general method could also be used: similate 1000 datasets with no rate variation, then find out the frequency distribution of coincidental low numbers. Compare your real dataset to it. If 1pm is the sparse slot in the real data, but 50/1000 simulated datasets have an equally sparse slot, then the result is not robust.
add a comment |
This is a great project. There is a challenge for projects like this, and your method of using simulated data is a great way of assessing it.
Do you have an a priori hypothesis, e.g. "people are more forgetful in the evening"? In that case, a statistical test that compares the frequency of forgetting in the evening compared to the morning will test it. This is a Bernoulli distribution, as previous responders said.
The other approach is to trawl your data to find out which time slot has the highest rate of failure. There is bound to be one, so the question is "is this just a chance result?". The threshold for significance is higher in this case. If you want to read up about this, search for "false discovery rate".
In your case the system is simple enough that you can calculate the threshold with a bit of thought. But the general method could also be used: similate 1000 datasets with no rate variation, then find out the frequency distribution of coincidental low numbers. Compare your real dataset to it. If 1pm is the sparse slot in the real data, but 50/1000 simulated datasets have an equally sparse slot, then the result is not robust.
add a comment |
This is a great project. There is a challenge for projects like this, and your method of using simulated data is a great way of assessing it.
Do you have an a priori hypothesis, e.g. "people are more forgetful in the evening"? In that case, a statistical test that compares the frequency of forgetting in the evening compared to the morning will test it. This is a Bernoulli distribution, as previous responders said.
The other approach is to trawl your data to find out which time slot has the highest rate of failure. There is bound to be one, so the question is "is this just a chance result?". The threshold for significance is higher in this case. If you want to read up about this, search for "false discovery rate".
In your case the system is simple enough that you can calculate the threshold with a bit of thought. But the general method could also be used: similate 1000 datasets with no rate variation, then find out the frequency distribution of coincidental low numbers. Compare your real dataset to it. If 1pm is the sparse slot in the real data, but 50/1000 simulated datasets have an equally sparse slot, then the result is not robust.
This is a great project. There is a challenge for projects like this, and your method of using simulated data is a great way of assessing it.
Do you have an a priori hypothesis, e.g. "people are more forgetful in the evening"? In that case, a statistical test that compares the frequency of forgetting in the evening compared to the morning will test it. This is a Bernoulli distribution, as previous responders said.
The other approach is to trawl your data to find out which time slot has the highest rate of failure. There is bound to be one, so the question is "is this just a chance result?". The threshold for significance is higher in this case. If you want to read up about this, search for "false discovery rate".
In your case the system is simple enough that you can calculate the threshold with a bit of thought. But the general method could also be used: similate 1000 datasets with no rate variation, then find out the frequency distribution of coincidental low numbers. Compare your real dataset to it. If 1pm is the sparse slot in the real data, but 50/1000 simulated datasets have an equally sparse slot, then the result is not robust.
answered Dec 20 '18 at 7:55
chrishmorris
742
742
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f383710%2fhow-to-simulate-data-to-be-statistically-significant%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
This is a great question. And something like this is what most scientists should be doing before applying for a grant, in the study design phase. I see far too often that people collect their data first and try to figure out how to analyze it afterwards, with the result that the statistician may only be able to tell what the experiment died of, in the words of Ronald Fisher.
– Stephan Kolassa
Dec 19 '18 at 8:53
@StephanKolassa However, is very hard to asses what data will be available in some experiments with human data, and in other setting one uses data that is available and cannot collect more...
– llrs
Dec 19 '18 at 11:11
2
@llrs: That is completely correct. And it should of course inform the simulation exercise. Better to think beforehand about what data are available, rather than finding out after the experiment that crucial data cannot be obtained.
– Stephan Kolassa
Dec 19 '18 at 11:46
(+1) I find the vote to close this question somewhat objectionable
– Robert Long
Dec 20 '18 at 18:24
@RobertLong, why do you say that? I ask simply because I want to make sure I am not missing anything in the response that makes it less credible.
– Neelasha Bhattacharjee
Dec 25 '18 at 3:46