The Introduction to Data Science class will survey the foundational topics in data science, namely:
* Data Manipulation
* Data Analysis with Statistics and Machine Learning
* Data Communication with Information Visualization
* Data at Scale -- Working with Big Data
The class will focus on breadth and present the topics briefly instead of focusing on a single topic in depth. This will give you the opportunity to sample and apply the basic techniques of data science.
Why Take This Course? You will have an opportunity to work through a data science project end to end, from analyzing a dataset to visualizing and communicating your data analysis.
Through working on the class project, you will be exposed to and understand the skills that are needed to become a data scientist yourself.
### Lesson 1: Introduction to Data Science
- Introduction to Data Science
- What is a Data Scientist
- Pi-Chaun (Data Scientist @ Google): What is Data Science?
- Gabor (Data Scientist @ Twitter): What is Data Science?
- Problems Solved by Data Science
- Create a New Dataframe
### Lesson 2: Data Wrangling
- What is Data Wrangling?
- Acquiring Data
- Common Data Formats
- What are Relational Databases?
- Aadhaar Data
- Aadhaar Data and Relational Databases
- Introduction to Databases Schemas
- Data in JSON Format
- How to Access an API efficiently
- Missing Values
- Easy Imputation
- Impute using Linear Regression
- Tip of the Imputation Iceberg
### Lesson 3: Data Analysis
- Statistical Rigor
- Kurt (Data Scientist @ Twitter) - Why is Stats Useful?
- Introduction to Normal Distribution
- T Test
- Welch T Test
- Non-Parametric Tests
- Non-Normal Data
- Stats vs. Machine Learning
- Different Types of Machine Learning
- Prediction with Regression
- Cost Function
- How to Minimize Cost Function
- Coefficients of Determination
### Lesson 4: Data Visualization
- Effective Information Visualization
- Napoleon's March on Russia
- Don (Principal Data Scientist @ AT&T): Communicating Findings
- Rishiraj (Principal Data Scientist @ AT&T): Communicating Findings Well
- Visual Encodings
- Perception of Visual Cues
- Plotting in Python
- Data Scales
- Visualizing Time Series Data
### Lesson 5: MapReduce
- Big Data and MapReduce
- Basics of MapReduce
- MapReduce with Aadhaar Data
- MapReduce with Subway Data
MOOCs stand for Massive Open Online Courses. These arefree online courses from universities around the world (eg. StanfordHarvardMIT) offered to anyone with an internet connection.
How do I register?
To register for a course, click on "Go to Class" button on the course page. This will take you to the providers website where you can register for the course.
How do these MOOCs or free online courses work?
MOOCs are designed for an online audience, teaching primarily through short (5-20 min.) pre recorded video lectures, that you watch on weekly schedule when convenient for you. They also have student discussion forums, homework/assignments, and online quizzes or exams.
Intro to data science is an intermediate level course that assumes basic Python programming skills and knowledge of statistics. The course focuses on gathering, manipulating, analyzing and visualizing data using Python and various Python packages such as numpy, scipy and pandas. One of the best parts about this course
Intro to data science is an intermediate level course that assumes basic Python programming skills and knowledge of statistics. The course focuses on gathering, manipulating, analyzing and visualizing data using Python and various Python packages such as numpy, scipy and pandas. One of the best parts about this course is getting some exposure to some Python packages in the scipy stack, although I wish more time was devoted to explaining what the various modules in the scipy stack do, how to set them up at home and when to use them.
The first lesson was fairly gentle introduction with an interesting homework project dealing with data from the Titanic disaster. Lesson 2 goes into more detail about gathering and cleaning data using Pandas and an additional module that lets you make SQL queries to extract data from Pandas data frames. Lesson 3 jumps into data analysis with a T test and linear regression using gradient descent. Going from basic data manipulation into these topics was a bit jarring in terms of difficulty and more time could have been spent explaining how the functions worked. I left without a great appreciation of what gradient descent is really doing. Lesson 4 is focused on making visualizations using a module that attempts to port the functionality R language’s ggplot2 plotting package. Finally, lesson 5 introduces the concept of big data and MapReduce as a solution to deal with large data sets. Each homework assignment after the first has students dealing with New York subway turnstile data, which allows students to get some level of familiarity with the data throughout the course. This was a very good decision, since it lets students focus on learning new concepts rather than spending time familiarizing themselves with new data sets over and over again.
Shahrukh Ahmedpartially completed this course, spending 5 hours a week on it and found the course difficulty to be easy.
Though the course uses interesting examples for teaching concepts in relation to data science, the over reliance of the online grader for practice often makes learning redundant. Big part of learning programming is experimentation which the grader does not allow for.