Coursera: Introduction to Apache Spark and AWS

 with  Dr Sorrel Harriet and Christophe Rhodes
Learn to analyze big data using Apache Spark's distributed computing framework.

In a series of focused, practical tasks, you will start by launching a spark cluster on Amazon's EC2 cloud computing platform. As you progress to working with real data, you will gain exposure to a variety of useful tools, including RDFlib and SPARQL.

The practical tasks on this course make use of the Gutenberg Project data - the world's largest open collection of ebooks. This offers no end of opportunity for highly engaging and novel analyses.

As the taught material and example code is given in Python, it is strongly recommended that all students have previous Python programming experience. Furthermore, launching and interacting with a cluster on EC2 requires basic knowledge of Unix command line, and some experience with a command-line editor such as vim or nano would also be advantageous.

With these minimal prerequisites, this course is designed to get you up and running in Spark as quickly and painlessly as possible, so that by the end, you will be comfortable and competent enough to start engineering your own big data solutions.


Getting Started in Spark on EC2
This week, you'll gain essential background knowledge along with the practical skills needed to run applications in Apache Spark. You'll also take the steps necessary to launch a Spark cluster on the Amazon EC2 cloud computing platform.

Reading and Writing Data
This week you'll learn how to read and write data in Spark. The techniques you'll be shown can be used with data stored locally, or in partnership with the Amazon S3 cloud storage facility. To help get you started, we'll also show you how to upload a subset of the Gutenberg Project dataset onto Amazon S3.

Tools for Working with Data
This week you'll be getting to grips with some useful tools in preparation for working with the Gutenberg Project data set. In this week's assessment, you will exercise your data wrangling skills to produce a catalogue index file from the Gutenberg Project meta data, a resource that should prove useful in your final assessment.

Programming in Spark
This week you'll learn Spark programming in some detail, in preparation for working with the Gutenberg collection of ebooks. The areas that will be covered should lead you to write much more efficient and successful Spark applications.

0 Student
Cost Free Online Course (Audit)
Pace Upcoming
Subject Big Data
Provider Coursera
Language English
Certificates Paid Certificate Available
Hours 3-6 hours a week
Calendar 4 weeks long
Sign up for free? Learn how

Disclosure: To support our site, Class Central may be compensated by some course providers.

+ Add to My Courses
FAQ View All
What are MOOCs?
MOOCs stand for Massive Open Online Courses. These are free online courses from universities around the world (eg. Stanford Harvard MIT) offered to anyone with an internet connection.
How do I register?
To register for a course, click on "Go to Class" button on the course page. This will take you to the providers website where you can register for the course.
How do these MOOCs or free online courses work?
MOOCs are designed for an online audience, teaching primarily through short (5-20 min.) pre recorded video lectures, that you watch on weekly schedule when convenient for you.  They also have student discussion forums, homework/assignments, and online quizzes or exams.

0 reviews for Coursera's Introduction to Apache Spark and AWS

Write a review

Class Central

Get personalized course recommendations, track subjects and courses with reminders, and more.

Sign up for free