This course is now part of two independent MITx MicroMasters programs. For both MicroMasters programs, learners will need to first enroll in and pass this course. However, each program will then require different final assessments for a course certificate toward the full MicroMasters credential: 1. MicroMasters in Data, Economics, and Development Policy (DEDP). To pursue the DEDP MicroMasters credential, pass this course, create a MicroMasters in DEDP profile, and pass an additional in-person proctored exam.

To learn more about the DEDP program and how it integrates with MIT’s new blended Master’s degree, please visit https://micromasters.mit.edu/ dedp/.

Complete all 4 courses and the capstone exam in the SDS program to accelerate your path towards graduate studies at MIT or other universities. To learn more, please visit https://micromasters.mit.edu/ ds/.

This statistics and data analysis course will introduce you to the essential notions of probability and statistics. We will cover techniques in modern data analysis: estimation, regression and econometrics, prediction, experimental design, randomized control trials (and A/B testing), machine learning, and data visualization. We will illustrate these concepts with applications drawn from real world examples and frontier research. Finally, we will provide instruction for how to use the statistical package R and opportunities for students to perform self-directed empirical analyses.

This course is designed for anyone who wants to learn how to work with data and communicate data-driven findings effectively.

Course Previews:

Our course previews are meant to give prospective learners the opportunity to get a taste of the content and exercises that will be covered in each course. If you are new to these subjects, or eager to refresh your memory, each course preview also includes some available resources. These resources may also be useful to refer to over the course of the semester.

A score of 60% or above in the course previews indicates that you are ready to take the course, while a score below 60% indicates that you should further review the concepts covered before beginning the course.

Understanding randomization in the context of experimentation

Introduction to nonparametric regression techniques

MODULE 9: SINGLE AND MULTIVARIATE LINEAR MODELS

In-depth discussion of the linear model and the multivariate linear model

MODULE 10: PRACTICAL ISSUES IN RUNNING REGRESSIONS, AND OMITTED VARIABLE BIAS

Covariates, fixed effects, and other functional forms

Introduction to regression discontinuity design

MODULE 11: INTRO TO MACHINE LEARNING AND DATA VISUALIZATION

Use of machine learning for prediction, covers tuning and training

Principles of data visualization

MODULE 12: ENDOGENEITY, INSTRUMENTAL VARIABLES, AND EXPERIMENTAL DESIGN

Understanding endogeneity problems and an introduction to instrumental variables and two stage least squares, and assessing the validity of an instrument

Designing an effective experiment with a case study from Indonesia

by
Jamescompleted this course, spending 12 hours a week on it and found the course difficulty to be medium.

Writing a review for this course is hard. The content of the course is ambitious and the promise is considerable. I am grateful that the Professors and MIT have made this course available online. That being said, I find it hard to recommend this course.

As an overview, each week contains 2-3 lectures, mostly probability mixed with some stats, with 'finger exercises (FEs)' at the end of each lecture segment to test knowledge. At the end of each week there is a more in-depth set of questions covering all the material and some more practical aspects with R. Here is a quick sum…

Writing a review for this course is hard. The content of the course is ambitious and the promise is considerable. I am grateful that the Professors and MIT have made this course available online. That being said, I find it hard to recommend this course.

As an overview, each week contains 2-3 lectures, mostly probability mixed with some stats, with 'finger exercises (FEs)' at the end of each lecture segment to test knowledge. At the end of each week there is a more in-depth set of questions covering all the material and some more practical aspects with R. Here is a quick summary of the different course aspects:

Lectures - Some students felt that the MOOC course was just a video of the in-person lectures, without the MOOC in mind, but I felt this was one of the benefits i.e. identical coverage

FE - good tests of comprehension and linked to the lectures well. Some of the wording was a little confusing, at times it felt as if the questions were setup to trick you

Homework - this and particularly the sections using R, were more enjoyable and felt a little better thought out, at least when the grader worked correctly

Staff Support - both the staff and community TAs were great, one student commented ' I cannot believe the response time of the staff (responses to comments are so quick!). Thanks!'

Discussion - the discussion on the boards was initially lively and the discussion on each individual item, e.g. part of a lecture, was useful both as an aid to understanding and also to identify any human error or typos in the lecture, of which there were some. Over time however, it seemed like students were either leaving the course or getting frustrated, so their used diminished

The course got off to a pretty good start over the first couple of weeks. The lectures are well structured and linked to the FEs. The introductory lecture by Professor Duflo was incredible and really whetted the appetite for the rest of the course. The use of R was introduced with a course specific set of walkthroughs (modules) in the R package SWIRL. Whilst I'd heard of SWIRL before, I hadn't actually used it. Personally I found it fine, although a number of people on the course seemed to feel SWIRL was a little dated and using something like an interactive Jupyter or R Markdown document, would have been better. There was an issue with the grader for some of the R code for the first homework, meaning 3 of the 20 questions were ungraded. Staff pinned an item on the discussion group as soon as this was identified, students were also emailed. Overall, an interesting introduction, if a little frustrating due to the grader issues.

The following week covering the Fundamentals of Probability was good, albeit I reviewed some of the lecture sections more than once and tried to dust down some of my probability from the dark recesses of memory. The lecture and FEs were well structured and flowed logically, the questions by students in the actual class and by those taking the online version were helpful. Some students felt that the content suffered from not using the same terminology as the MIT class 6.431x (Probability - The Science of Uncertainty and Data) and that the content of this course was more akin to a review of probability for those already familiar with it. Given this is supposed to be taken after that course, this may be a fair assessment. Some of the lecture parts, for instance the cumulative probability function, felt like they were overly complicated explanations of things which were intuitively quite straight forward, at one point the lecturer even noted 'intuitively they make complete sense', so it wasn't just me! There were multiple errors in the content, both in some of the lecture explanations and in the homework.

Next was an R lecture was on Gathering and Collecting data. I thought this was a good introductory lecture and gave enough of an understanding about how, for instance, to webscrape in r using rvest. However, I suspect those already familiar with web scraping may find this a little too introductory. Those coming more from a programming background may find it useful to cover some of the other aspects the lectures covered.

This included things like the process of data collection, piloting questionnaires, having a data management plan and seeking ethical approval for the intended research. However, these concepts were really just touched upon rather than covered in any great depth. They did give pause for consideration.

In week three more problems started to emerge. Firstly there was an error in a histogram on a FE question - four large bars which, when combined, appears to exceed 100%. The staff TA noted "Yes, you are right. This is embarrassing". The homework had multiple issues. In one example, the data used teenager fertility rates from 1960s from the online World Bank data catalog. On review it appeared the underlying data had changed, resulting in the grader being based on older and therefore different data resulting in erroneous marking. A data file for the course content was missing and was provided by a Community TA from their own google drive, from a previous iteration of the course. There were multiple calls on the forum for a github repository to be created for the course. Examples in R were often provided in the lectures, but the underling R code was not available to students. One of the TAs had requested the R code from the lecturers, however it didn't arrive. On some occasions, the code was provided by other students.

By week four the course started to really feel like a slog. The lectures seem to cover the material at too high a level, but at the same time required a lot of prior understanding of concepts. The first lecture was on Functions of Random Variables (RVs). It was demonstrated how RVs were calculated, but it felt a little too mathematical with little connection to the real world. At one point, the lecturer presents information on Probability Integral Transformation, then looks to the audience and notes "I'm seeing some puzzled looks", which after a period of silence, a student asks the question possibly everyone is wondering - "could you do some examples of when this would be used"? I certainly felt this needed more relation to the real world. A further couple of questions from the lecturer are met with silence, until one student hazards a rather uncertain sounding response. After further explanation the lecturer rhetorically asks 'what's the application' and answers the question themselves stating 'I happen to think it's pretty cool'. The discussions after the lectures were still there, but with many fewer students involved. There was much less engagement from students in the actual class. The TAs (staff and community) were still providing a good response and supporting students, however one could question - should such support be required if the lectures stand up on their own?

One student noted, directing their comment at staff "I am afraid that there are just too many questions that are unclear, confusing, badly worded, etc. in this course. It wastes time and creates uncertainty in my mind when I'm looking at all questions now. I keep on having to ask myself if there could be a problem with the question itself. This shouldn't be happening on an MIT course!"

It was at the end of week four I had to decide not to invest any more time in the course. How could this course be improved?

1 - Be clearer about the pre-requisites - single and multivariable calculus plus an introductory level of probability and statistics is required

2 - Consistency with terminology - both with common practices and MITs other courses

3 - Correct the errors - there were too many errors both in lectures and questions, this makes students doubt their own knowledge

4 - Provide R code - the course is supposed to cover both theory and R and providing the R code would make the course more tangible and applicable

5 - Provide more examples - the material was often too abstract for me, or the examples given were too simplistic and not related to the course objectives, namely providing skills for Social Scientist

I would like to come back and complete the course at a later date, currently it feels a bit too ambitious and requires some polishing before being ready. It may be necessary to split the course over two semesters or two parts, if more examples and code is provided, to better achieve the course objectives.

I did not enjoy this course at all. Here are the main reasons why:

1) Lecture videos. The lecturers themselves might be masters in their respective fields, but the lecture videos are not suitable for an online course. The videos are literally from the actual MIT course. The most annoying thing about the videos is that the lecturers sometimes make reference to something shown on the board or the screen, but the video just shows one thing at a time: either the view of the lecturer or what is being projected on the screen. It feels weird and I don't think they really thought how stud…

I did not enjoy this course at all. Here are the main reasons why:

1) Lecture videos. The lecturers themselves might be masters in their respective fields, but the lecture videos are not suitable for an online course. The videos are literally from the actual MIT course. The most annoying thing about the videos is that the lecturers sometimes make reference to something shown on the board or the screen, but the video just shows one thing at a time: either the view of the lecturer or what is being projected on the screen. It feels weird and I don't think they really thought how students who are online would feel. Also, because the videos are recordings of a live class, there are students who ask questions, but most of the time you can't really hear what they are saying (the transcripts often just say "inaudible"). This is quite frustrating, as questions are asked often.

2) Quality of the material. I think they didn't explain many concepts properly. The lecturers would introduce concepts in a way that they would assume that we already know what they are talking about. Notes that are shown on screen are also quite vague. I would like more detailed explanations and worked examples. Speaking of examples, the lecturers would do a few examples, but I think the choice of examples were sometimes too confusing. I would have liked to see more examples worked out properly. Also, R programming was only briefly covered in the lecture videos

3) Quizzes and homework. The quizzes are literally just questions based directly from whatever the lecturers mentioned. While this tests if you actually watched the video, I feel like they could have added more questions requiring us to work out/calculate more things. It felt like these questions were made just for the sake of it. Many of the answer explanations are not very well explained.

There are also a few other issues and problems I had with this course. I really hope they improve this course on the next iteration.

Dileepcompleted this course, spending 12 hours a week on it and found the course difficulty to be medium.

tl;dr - poorly put together MOOC trying to cram too many things into one course. Doesn't leave you with a lot of confidence that you can analyse data independently on big projects. Confusing approach without concrete examples and demos of how to run full tests in practice. I found it a frustrating experience.

This course tries to cover a lot of ground in very little time, so, the treatment is very superficial. In the end, what you're left with is a hodge-podge of techniques and methods with no real intuitive understanding of what they mean and without a solid understanding of how…

tl;dr - poorly put together MOOC trying to cram too many things into one course. Doesn't leave you with a lot of confidence that you can analyse data independently on big projects. Confusing approach without concrete examples and demos of how to run full tests in practice. I found it a frustrating experience.

This course tries to cover a lot of ground in very little time, so, the treatment is very superficial. In the end, what you're left with is a hodge-podge of techniques and methods with no real intuitive understanding of what they mean and without a solid understanding of how they are to be applied. It would be a much more useful course if it went into more depth and focused on a narrower range of topics. Many of the questions in the homework are vaguely worded (the answer keys have several issues as well) and/or there's a significant jump in difficulty compared to what's covered in the lectures. I often find myself struggling to extrapolate and apply what I've learned in class to homework problems and other courses/projects. Contrary to what the course advertises, it assumes some prior knowledge of probability theory and core ideas of statistics. Despite the impressive range of topics on paper, it feels like an unorganized and poorly thought out MOOC (I don't know about the on-campus version of the class), in that the goals for the class were not clearly laid out and executed. Think of this class as a 'survey' course of data analysis techniques relevant to the social sciences designed for people who already know some probability and statistics. That being said, I found the included R course (and supplemental resources) helpful as I hadn't used it earlier.

by
Paul is taking this course right now, spending 16 hours a week on it and found the course difficulty to be hard.

To say this class is thorough is an understatement. The lectures are extremely detailed, sometimes with additional detailed references(!), and it occasionally warrants going back and replaying one or two of the lectures before moving on. There is a good deal of statistics and probability review and training prior to getting to the "methods" of this class (around week 8). I recommend this course as I cannot imagine a better, more thorough treatment for the topic, taught by some of the "best" there are out there today in Economics and Statistics.

I was very excited about this course - its scope and the fact that it did not require any knowledge in statistics. That is not true: you should know some probability and statistics, otherwise you will not be able to keep up with the workload (or the classes, to be honest) and will drop out - like I did.

Will try again later, when I have gained some statistics knowledge.

by
Antonello is taking this course right now, spending 15 hours a week on it and found the course difficulty to be easy.

It is a strange "mixed beast" course.

At the end I don't know what this course is good for. Too many different things (prob theory, programming in R, Statistics..) approached superficially, and most important, without even give the "intuitions" behind what it is used..

Not much added value, and honestly I don't know why it has been added to the list of required courses for the new MIT MicroMaster program in data science.

Ayse is taking this course right now, spending 15 hours a week on it and found the course difficulty to be very hard.

There is just too much theory in the course. I only wanted to learn some Data Analysis and possibly Machine Learning. But no, it just doesn't happen. I was excited about this course, but there is little practical value compared to the effort you need to spend on the coursework. Unless you are already good at multivariable calculus and probability theory at the level of this course: https://www.edx.org/course/probability-the-science-of-uncertainty-and-data

then you will probably feel the same frustration as me.