Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Impatient Data Analysis Series:Introduction

$
0
0

作者:flying dutchman

Introduction

This blog is created to help impatient people, or people under impatient boss/supervisors, to learn to perform statistical analysis in a fast yet dirty way. The content of the blogs include

A bare minimum introduction to mathematics and statistics An equally bare minimum introduction to programming language and concepts (Most important) Implementation of commonly used statistical (machine learning, data mining, if you would like to call them) models, including their mathematical details and programming implementation in C(!)

Besides I will also share miscellaneous staffs, including beautiful proofs, some newly developed (bad) marketing models, comments, compiler construction etc.

Why the haste

It is, in my opinion, very important to learn both mathematics and programming systematically to be a good data analyst. However, not everyone (if not no one at all) has the luxury to do a master in mathematics and computer science. In fact, I know many colleagues that are forced to learn data analysis from ground zero, within several months, if not weeks. Consequently, they have to implement everything in a fast and dirty way. While resulting models and algorithms might be extremely premature, it does pose another possible way to learn data analysis, majorly by studying and modifying existing implementations. While this is certainly not the most desirable way to enter a field, it does coincide with my opinion that mathematics and programming are skills instead of pure knowledge accumulation, and skills can be only sharpened by practice.

Why C (but not R/python)

R/Python is great language for data analysis. However, its greatness majorly lie in the abundance of available packages. In this sense, learning to call routines written in R/Python is probably THE fastest way of presenting some data analysis result. `

Unfortunately, the difficulty of statistical analysis usually lies in understanding the nature of the models, but not how to call certain routines in R/Python. In this sense, since the details of R/Python routines are usually totally hidden, it does not help much in understanding the models.

There are also other reasons in using C, including

C is fast It is actually not much more difficult to implement algorithms in C than in R/Python (although C is a little bit verbose) C routines can be easily incorporated into other languages, such as Java (via e.g. Swig), Python(via Cython) and etc. The writing styles

All the entries of this blog will be as short as possible, except this introduction. This also means this blog will be less entertaining than others. In some way, this blog is to be read as a technique manual.

Models that will be implemented Non-probability based models: Trees, SVM, Neural Network, Boosting, Model Averaging Probability-based models: discrete choice model, count regression, survival analysis, mixed and random coefficient models, HMM models. Both the frequentist estimation methods (MLE) and Bayesian Methods (Marjorly MCMC) will be discussed Linear models: Linear Regressions, GMM models, Structural Equation Models (LISREL) Models for High Dimensional Data Set: “LASSO”-type models ($L_1$ penalties and others), both coordinate decent algorithm and path algorithm will be discussed. Bayesian shrinkage models will also be discussed Non-parametric (Semi-parametric) models: Density estimation using kernel methods, regression with splines.

Please note this is not an exhaustive list. New models will be added once fit. Furthermore, many of the above methods can be combined to yield more powerful models.

Notes Since the rational is “fast and dirty”, no robust checking will be performed. It is suggested the readers should try to derive and implement the models themselves before looking at the “solutions” In reality, it should be noted that one often encounters situations where no existing models are appropriate. This is also why I put the focus on deriving the models instead of presenting them as a fact. Why not Chinese

Related contents of such are mostly in English. Since this blog is not to be exhaustive, it is actually much more difficult for readers to search for relevant English literature after reading Chinese ones. The readers, should not be frightened away as the difficulty lies mostly in understanding the mathematics and the programs, regardless of the explanation. However, if time allows, some entries might be translate into Chinese.


Viewing all articles
Browse latest Browse all 9596

Trending Articles