Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Basic Machine Learning in Python with Scikit-learn

$
0
0

Machine learning has become a hot topic in the last few years and it is for a reason. It provides data analysts with efficientways of extracting information from data, allowing it to be usedfor analysis and modeling purposes.

The Scikit-learn python library has implementations ofdozens of learning algorithms and is freelyavailable for academic and commercial use under the terms of the BSD licence. Some of these algorithms can be extremely useful for our job as water systems analysts, so given the overwelming amount of algorithmsimplemented in Scikit-learn, I though I would mention a few I find particularly useful for my research. For each method belowI included link swith an examples from the Scikit-learn’s website. Instalation and use instructions can be found in their website .

CART Trees

CARTtrees that can used forregression or classification. Any any tree,CART trees are considered poor (generally high variance) classifiers unless bootstrapped or boosted (see supervised learning), but the resulting rules are easily interpretable.

CART Trees: http://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

Dimensionality reduction

Principal component Analysis (PCA) is perhaps the most widely useddimensionality reduction technique.It works byfinding the basis the maximizes the data’s variance, allowing for the elimination of axis that have low variances.Among its uses arenoise reduction, data visualization, as it preserves the distances between data points, and improvement ofcomputational efficiency of other algorithms bygetting rid of redundant information.PCA can me used in its pure form or it can be kernelized to handledata setswhose variance is maximumin a non-linear direction.Manifold learning is another way of performing dimensionality reduction by unwinding the lower dimensional manifold where the information lies.

PCA: http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html#sphx-glr-auto-examples-decomposition-plot-pca-3d-py

Kernel PCA: http://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html#sphx-glr-auto-examples-decomposition-plot-kernel-pca-py

Manifold learning: http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

Clustering

Clustering is used to group similar points in a data set. One example is the problem of find customer niches based on the products each customer buys. The most famous clustering algorithm is k-means, which,as any other machine learning algorithm, works well on some data sets but not in others.There are several alternative algorithms, all of which exemplified in the following two links:

Clustering algorithmscomparison: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

Gaussian Mixture Models(finds the sameresultsas k-means but also provides variances): http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

Reducing the dimentionality of a dataset with PCA or kernel PCA may speed up clustering algorithms.

Supervised learning

Supervised learning algorithms can be used forregression or classification problems (e.g. classify a point as pass/fail) based on labeled data sets. The most “trendy” one nowadays is neural networks, but support vector machines, boosted andbagged trees, and others are also options that should be considered and tested on your data set. Bellow are links to some of the supervised learning algorithms implemented in Scikit-learn:

Comparison between supervised learning algorithms : http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

Neural networks: http://scikit-learn.org/stable/modules/neural_networks_supervised.html

Gaussian Processes is also a supervised learning algorithm (regression) which isalso be used for Bayesian optimization:

Gaussian processes: http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py


Viewing all articles
Browse latest Browse all 9596

Trending Articles