How to simulate data to be statistically significant?











up vote
4
down vote

favorite












I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.



However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.



Is there any program that can help me achieve something like this? Or maybe a python module?



Any help whatsoever (even general comments on my project) will be extremely appreciated!!










share|cite|improve this question







New contributor




lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
























    up vote
    4
    down vote

    favorite












    I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.



    However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.



    Is there any program that can help me achieve something like this? Or maybe a python module?



    Any help whatsoever (even general comments on my project) will be extremely appreciated!!










    share|cite|improve this question







    New contributor




    lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






















      up vote
      4
      down vote

      favorite









      up vote
      4
      down vote

      favorite











      I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.



      However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.



      Is there any program that can help me achieve something like this? Or maybe a python module?



      Any help whatsoever (even general comments on my project) will be extremely appreciated!!










      share|cite|improve this question







      New contributor




      lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.



      However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.



      Is there any program that can help me achieve something like this? Or maybe a python module?



      Any help whatsoever (even general comments on my project) will be extremely appreciated!!







      machine-learning statistical-significance t-test python simulation






      share|cite|improve this question







      New contributor




      lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|cite|improve this question







      New contributor




      lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|cite|improve this question




      share|cite|improve this question






      New contributor




      lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 3 hours ago









      lotuslotion

      261




      261




      New contributor




      lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      lotuslotion is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          4
          down vote













          General Comments




          • "I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.


          • Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as rnorm for a normal distribution, runif for the uniform distribution, rbeta for the beta distribution, and so on. In R, typing in ?Distributions will show you a help page on them. However, there are many other cool packages like mvtnorm or simstudy that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to things


          • It seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.



          Specific Comments



          It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.



          You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.



          Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.



          Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #:



          set.seed(1839) # this makes sure the results are replicable when you do it
          n <- 1000 # sample size is 1000
          times <- c("morning", "afternoon", "evening") # create a vector of times
          time <- sample(times, n, TRUE) # create our time variable

          # make adherence probabilities based on time
          adhere_prob <- ifelse(
          time == "morning", .80,
          ifelse(
          time == "afternoon", .50, .65
          )
          )

          # simulate observations from binomial distribution with those probabilities
          adhere <- rbinom(n, 1, adhere_prob)

          # run a logistic regression, predicting adherence from time
          model <- glm(adhere ~ time, family = binomial)
          summary(model)


          This summary shows, in part:



          Coefficients:
          Estimate Std. Error z value Pr(>|z|)
          (Intercept) 0.02882 0.10738 0.268 0.78839
          timeevening 0.45350 0.15779 2.874 0.00405 **
          timemorning 1.39891 0.17494 7.996 1.28e-15 ***
          ---
          Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


          The Intercept represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf



          I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).



          Lastly, you can also simulate having no effect by setting all of the times to have the same probability:



          set.seed(1839)
          n <- 1000
          times <- c("morning", "afternoon", "evening")
          time <- sample(times, n, TRUE)
          adhere <- rbinom(n, 1, .6) # same for all times
          summary(glm(adhere ~ time, binomial))


          Which returns:



          Coefficients:
          Estimate Std. Error z value Pr(>|z|)
          (Intercept) 0.40306 0.10955 3.679 0.000234 ***
          timeevening -0.06551 0.15806 -0.414 0.678535
          timemorning 0.18472 0.15800 1.169 0.242360
          ---
          Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


          This shows no significant differences between the times, as we would expect from the probability being the same across times.






          share|cite|improve this answer























            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "65"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            lotuslotion is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f383710%2fhow-to-simulate-data-to-be-statistically-significant%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            4
            down vote













            General Comments




            • "I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.


            • Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as rnorm for a normal distribution, runif for the uniform distribution, rbeta for the beta distribution, and so on. In R, typing in ?Distributions will show you a help page on them. However, there are many other cool packages like mvtnorm or simstudy that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to things


            • It seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.



            Specific Comments



            It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.



            You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.



            Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.



            Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #:



            set.seed(1839) # this makes sure the results are replicable when you do it
            n <- 1000 # sample size is 1000
            times <- c("morning", "afternoon", "evening") # create a vector of times
            time <- sample(times, n, TRUE) # create our time variable

            # make adherence probabilities based on time
            adhere_prob <- ifelse(
            time == "morning", .80,
            ifelse(
            time == "afternoon", .50, .65
            )
            )

            # simulate observations from binomial distribution with those probabilities
            adhere <- rbinom(n, 1, adhere_prob)

            # run a logistic regression, predicting adherence from time
            model <- glm(adhere ~ time, family = binomial)
            summary(model)


            This summary shows, in part:



            Coefficients:
            Estimate Std. Error z value Pr(>|z|)
            (Intercept) 0.02882 0.10738 0.268 0.78839
            timeevening 0.45350 0.15779 2.874 0.00405 **
            timemorning 1.39891 0.17494 7.996 1.28e-15 ***
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


            The Intercept represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf



            I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).



            Lastly, you can also simulate having no effect by setting all of the times to have the same probability:



            set.seed(1839)
            n <- 1000
            times <- c("morning", "afternoon", "evening")
            time <- sample(times, n, TRUE)
            adhere <- rbinom(n, 1, .6) # same for all times
            summary(glm(adhere ~ time, binomial))


            Which returns:



            Coefficients:
            Estimate Std. Error z value Pr(>|z|)
            (Intercept) 0.40306 0.10955 3.679 0.000234 ***
            timeevening -0.06551 0.15806 -0.414 0.678535
            timemorning 0.18472 0.15800 1.169 0.242360
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


            This shows no significant differences between the times, as we would expect from the probability being the same across times.






            share|cite|improve this answer



























              up vote
              4
              down vote













              General Comments




              • "I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.


              • Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as rnorm for a normal distribution, runif for the uniform distribution, rbeta for the beta distribution, and so on. In R, typing in ?Distributions will show you a help page on them. However, there are many other cool packages like mvtnorm or simstudy that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to things


              • It seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.



              Specific Comments



              It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.



              You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.



              Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.



              Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #:



              set.seed(1839) # this makes sure the results are replicable when you do it
              n <- 1000 # sample size is 1000
              times <- c("morning", "afternoon", "evening") # create a vector of times
              time <- sample(times, n, TRUE) # create our time variable

              # make adherence probabilities based on time
              adhere_prob <- ifelse(
              time == "morning", .80,
              ifelse(
              time == "afternoon", .50, .65
              )
              )

              # simulate observations from binomial distribution with those probabilities
              adhere <- rbinom(n, 1, adhere_prob)

              # run a logistic regression, predicting adherence from time
              model <- glm(adhere ~ time, family = binomial)
              summary(model)


              This summary shows, in part:



              Coefficients:
              Estimate Std. Error z value Pr(>|z|)
              (Intercept) 0.02882 0.10738 0.268 0.78839
              timeevening 0.45350 0.15779 2.874 0.00405 **
              timemorning 1.39891 0.17494 7.996 1.28e-15 ***
              ---
              Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


              The Intercept represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf



              I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).



              Lastly, you can also simulate having no effect by setting all of the times to have the same probability:



              set.seed(1839)
              n <- 1000
              times <- c("morning", "afternoon", "evening")
              time <- sample(times, n, TRUE)
              adhere <- rbinom(n, 1, .6) # same for all times
              summary(glm(adhere ~ time, binomial))


              Which returns:



              Coefficients:
              Estimate Std. Error z value Pr(>|z|)
              (Intercept) 0.40306 0.10955 3.679 0.000234 ***
              timeevening -0.06551 0.15806 -0.414 0.678535
              timemorning 0.18472 0.15800 1.169 0.242360
              ---
              Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


              This shows no significant differences between the times, as we would expect from the probability being the same across times.






              share|cite|improve this answer

























                up vote
                4
                down vote










                up vote
                4
                down vote









                General Comments




                • "I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.


                • Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as rnorm for a normal distribution, runif for the uniform distribution, rbeta for the beta distribution, and so on. In R, typing in ?Distributions will show you a help page on them. However, there are many other cool packages like mvtnorm or simstudy that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to things


                • It seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.



                Specific Comments



                It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.



                You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.



                Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.



                Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #:



                set.seed(1839) # this makes sure the results are replicable when you do it
                n <- 1000 # sample size is 1000
                times <- c("morning", "afternoon", "evening") # create a vector of times
                time <- sample(times, n, TRUE) # create our time variable

                # make adherence probabilities based on time
                adhere_prob <- ifelse(
                time == "morning", .80,
                ifelse(
                time == "afternoon", .50, .65
                )
                )

                # simulate observations from binomial distribution with those probabilities
                adhere <- rbinom(n, 1, adhere_prob)

                # run a logistic regression, predicting adherence from time
                model <- glm(adhere ~ time, family = binomial)
                summary(model)


                This summary shows, in part:



                Coefficients:
                Estimate Std. Error z value Pr(>|z|)
                (Intercept) 0.02882 0.10738 0.268 0.78839
                timeevening 0.45350 0.15779 2.874 0.00405 **
                timemorning 1.39891 0.17494 7.996 1.28e-15 ***
                ---
                Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


                The Intercept represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf



                I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).



                Lastly, you can also simulate having no effect by setting all of the times to have the same probability:



                set.seed(1839)
                n <- 1000
                times <- c("morning", "afternoon", "evening")
                time <- sample(times, n, TRUE)
                adhere <- rbinom(n, 1, .6) # same for all times
                summary(glm(adhere ~ time, binomial))


                Which returns:



                Coefficients:
                Estimate Std. Error z value Pr(>|z|)
                (Intercept) 0.40306 0.10955 3.679 0.000234 ***
                timeevening -0.06551 0.15806 -0.414 0.678535
                timemorning 0.18472 0.15800 1.169 0.242360
                ---
                Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


                This shows no significant differences between the times, as we would expect from the probability being the same across times.






                share|cite|improve this answer














                General Comments




                • "I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.


                • Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as rnorm for a normal distribution, runif for the uniform distribution, rbeta for the beta distribution, and so on. In R, typing in ?Distributions will show you a help page on them. However, there are many other cool packages like mvtnorm or simstudy that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to things


                • It seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.



                Specific Comments



                It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.



                You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.



                Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.



                Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #:



                set.seed(1839) # this makes sure the results are replicable when you do it
                n <- 1000 # sample size is 1000
                times <- c("morning", "afternoon", "evening") # create a vector of times
                time <- sample(times, n, TRUE) # create our time variable

                # make adherence probabilities based on time
                adhere_prob <- ifelse(
                time == "morning", .80,
                ifelse(
                time == "afternoon", .50, .65
                )
                )

                # simulate observations from binomial distribution with those probabilities
                adhere <- rbinom(n, 1, adhere_prob)

                # run a logistic regression, predicting adherence from time
                model <- glm(adhere ~ time, family = binomial)
                summary(model)


                This summary shows, in part:



                Coefficients:
                Estimate Std. Error z value Pr(>|z|)
                (Intercept) 0.02882 0.10738 0.268 0.78839
                timeevening 0.45350 0.15779 2.874 0.00405 **
                timemorning 1.39891 0.17494 7.996 1.28e-15 ***
                ---
                Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


                The Intercept represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf



                I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).



                Lastly, you can also simulate having no effect by setting all of the times to have the same probability:



                set.seed(1839)
                n <- 1000
                times <- c("morning", "afternoon", "evening")
                time <- sample(times, n, TRUE)
                adhere <- rbinom(n, 1, .6) # same for all times
                summary(glm(adhere ~ time, binomial))


                Which returns:



                Coefficients:
                Estimate Std. Error z value Pr(>|z|)
                (Intercept) 0.40306 0.10955 3.679 0.000234 ***
                timeevening -0.06551 0.15806 -0.414 0.678535
                timemorning 0.18472 0.15800 1.169 0.242360
                ---
                Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


                This shows no significant differences between the times, as we would expect from the probability being the same across times.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited 1 hour ago

























                answered 1 hour ago









                Mark White

                5,47731242




                5,47731242






















                    lotuslotion is a new contributor. Be nice, and check out our Code of Conduct.










                    draft saved

                    draft discarded


















                    lotuslotion is a new contributor. Be nice, and check out our Code of Conduct.













                    lotuslotion is a new contributor. Be nice, and check out our Code of Conduct.












                    lotuslotion is a new contributor. Be nice, and check out our Code of Conduct.
















                    Thanks for contributing an answer to Cross Validated!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f383710%2fhow-to-simulate-data-to-be-statistically-significant%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Morgemoulin

                    Scott Moir

                    Souastre