Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

NumPy Tutorial: Data analysis with Python

$
0
0

NumPy is a commonly used python data analysis package. By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn , that use NumPy under the hood. NumPy was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages NumPy in some way.

In this tutorial, we’ll walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity , along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we’ll try to figure out more about the perceived quality of wine.


NumPy Tutorial: Data analysis with Python

The wines we'll be analyzing are from the Minho region of Portugal.

The data was downloaded from the UCI Machine Learning Repository , and is available here . Here are the first few rows of the winequality-red.csv file, which we’ll be using throughout this tutorial:

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality" 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5

The data is in what I’m going to call ssv (semicolon separated values) format each record is separated by a semicolon ( ; ), and rows are separated by a new line. There are 1600 rows in the file, including a header row, and 12 columns.

Before we get started, a quick version note we’ll be using Python 3.5 . Our code examples will be done using Jupyter notebook .

If you want to jump right into a specific area, here are the topics:

Lists Of Lists for CSV Data

Before using NumPy, we’ll first try to work with the data using Python and the csv package. We can read in the file using the csv.reader object, which will allow us to read in and split up all the content from the ssv file.

In the below code, we:

Import the csv library. Open the winequality-red.csv file. With the file open, create a new csv.reader object. Pass in the keyword argument delimiter=";" to make sure that the records are split up on the semicolon character instead of the default comma character. Call the list type to get all the rows from the file. Assign the result to wines . In[1]:

import csv with open("winequality-red.csv", 'r') as f: wines = list(csv.reader(f, delimiter=";"))

Once we’ve read in the data, we can print out the first 3 rows:

In[3]: print(wines[:3]) [['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'], ['7.4', '0.7', '0', '1.9', '0.076', '11', '34', '0.9978', '3.51', '0.56', '9.4', '5'], ['7.8', '0.88', '0', '2.6', '0.098', '25', '67', '0.9968', '3.2', '0.68', '9.8', '5']]

The data has been read into a list of lists. Each inner list is a row from the ssv file. As you may have noticed, each item in the entire list of lists is represented as a string, which will make it harder to do computations.

We’ll format the data into a table to make it easier to view:

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality 7.4 0.70 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 7.8 0.88 0 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5

As you can see from the table above, we’ve read in three rows, the first of which contains column headers. Each row after the header row represents a wine. The first element of each row is the fixed acidity , the second is the volatile acidity , and so on. We can find the average quality of the wines. The below code will:

Extract the last element from each row after the header row. Convert each extracted element to a float. Assign all the extracted elements to the list qualities . Divide the sum of all the elements in qualities by the total number of elements in qualities to the get the mean. In[4]: qualities = [float(item[-1]) for item in wines[1:]] sum(qualities) / len(qualities) Out[4]:

5.6360225140712945

Although we were able to do the calculation we wanted, the code is fairly complex, and it won’t be fun to have to do something similar every time we want to compute a quantity. Luckily, we can use NumPy to make it easier to work with our data.

Numpy 2-Dimensional Arrays With NumPy, we work with multidimensional arrays. We’ll dive into all of the possible types of multidimensional arrays later on, but for now, we’ll focus on 2-dimensional arrays. A 2-dimensional array is also known as a matr

Viewing all articles
Browse latest Browse all 9596

Latest Images

Trending Articles