Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Data Mining Coursera for Popular Courses with Python

$
0
0

In a few years, massive open online courses, most commonly referred to as MOOCs, have exploded in popularity. These online courses provide high quality content taught by the best professors in their respective fields. With more that 20 million students, Coursera is one of the leaders of the MOOC movement. Coursera provides a platform that connects leading universities with students worldwide. I took my first Coursera course back in 2012, and since then I try to take one course a month.

In 2015, Coursera has around 1000 courses offered by 100+ universities in more than 10 languages. With this huge number of courses, it became difficult to decide on what course to take.

In this tutorial, I will combine coursera data together with social media data to assess the popularity of courses. To to this, I will use the Coursera API to retrieve the course catalogue, I will use the sharecount.com API to get social media metrics for each course, and I will use python's pandas library to query and order the courses by popularity.


Data Mining Coursera for Popular Courses with Python
Data Mining Coursera for Popular Courses with Python

The source code for this tutorial can be found in this github repository .

1. Getting courses data

Coursera provides an API for accessing the courses, universities, categories, instructors and sessions data. For this tutorial, we will be using the courses, universities and categories data.

in a Python shell, we start by importing the 3 following libraries: urllib2 , json , and pandas . Both urllib2 and json libraries are part of the core python libraries and don't need to be installed separately.

In[1]:

import urllib2 import json import pandas as pd

Next, we access the coursera API and download the course catalogue. For each course, we are interested in 3 fields:

shortName : The short name associated with the course. name : The course name or title. language : The language code for the course. (e.g. 'en' means English.)

We also include in the query universities and categories parameters. This will return the ids that matches each course with their corresponding universities and categories. Below are the Python commands to do so.

In[2]: courses_response = urllib2.urlopen('https://api.coursera.org/api/catalog.v1/courses?fields=shortName,name,language&includes=universities,categories') courses_data = json.load(courses_response) courses_data = courses_data['elements']

If we want to get the data about the first course in the courses_data dictionary, we simply execute the command below.

In[3]: courses_data[0] Out[3]: {u'id': 2163, u'language': u'en', u'links': {u'categories': [8, 10, 19, 20], u'universities': [65]}, u'name': u'The Land Ethic Reclaimed: Perceptive Hunting, Aldo Leopold, and Conservation', u'shortName': u'perceptivehunting'}

The first course in the courses_data dictionary is 'The Land Ethic Reclaimed: Perceptive Hunting, Aldo Leopold, and Conservation', offered in English by the university with id=65, and under categories 8, 10, 19 and 20.

Next, we retrieve the universities and categories data from the Coursera API. For the universities data, we are interested in the university name and its location.

In[4]: universities_response = urllib2.urlopen('https://api.coursera.org/api/catalog.v1/universities?fields=name,locationCountry') universities_data = json.load(universities_response) universities_data = universities_data['elements']

We can get the data about the first university from universities_data by executing the following command:

In[5]: universities_data[0] Out[5]:

{u'id': 234, u'links': {}, u'locationCountry': u'CN', u'name': u"Xi'an Jiaotong University", u'shortName': u'xjtu'}

Similarly, we can get the courses categories data by executing the following commands.

In[6]: categories_response = urllib2.urlopen('https://api.coursera.org/api/catalog.v1/categories') categories_data = json.load(categories_response) categories_data = categories_data['elements']

We can get the data about the the first category from categories_data by executing the following command:

In[7]: categories_data[0] Out[7]:

{u'id': 5, u'links': {}, u'name': u'Mathematics', u'shortName': u'math'}

2. Structuring the data

In this section, we will structure the courses_data , universities_data and categories_data into pandas DataFrames, and we will map the universities and categories ids with the corresponding names. By the end of this section, we will have one pandas DataFrame called courses_df that will have all the necessary data in a well structured format.

2.1. Putting the data into Pandas DataFrames

First, we start by creating a pandas DataFrame for the courses data.

In[8]:

courses_df = pd.DataFrame()

Next, we add the course_name , course_language , course_short_name , categories and universities columns to the courses_df DataFrame.

In[9]: courses_df['course_name'] = map(lambda course_data: course_data['name'], courses_data) courses_df['course_language'] = map(lambda course_data: course_data['language'], courses_data) courses_df['course_short_name'] = map(lambda course_data: course_data['shortName'], courses_data) courses_df['categories'] = map(lambda course_data: course_data['links']['categories'] if 'categories' in course_data['links'] else [], courses_data) courses_df['universities'] = map(lambda course_data: course_data['links']['universities'] if 'universities' in course_data['links'] else [], courses_data)

We can print the first 5 rows from the courses_df DataFrame by executing the command below.

In[10]:

courses_df.head()

Out[10]: course_name course_language course_short_name categories universities 0 The Land Ethic Reclaimed: Perceptive Hunting, ... en perceptivehunting [8, 10, 19, 20] [65] 1 Contraception: Choices, Culture and Consequences en contraception [3, 8] [10] 2

Viewing all articles
Browse latest Browse all 9596

Trending Articles