Impatient Data Analysis Series：Introduction

作者：flying dutchman

Introduction

This blog is created to help impatient people, or people under impatient boss/supervisors, to learn to perform statistical analysis in a fast yet dirty way. The content of the blogs include

A bare minimum introduction to mathematics and statistics An equally bare minimum introduction to programming language and concepts (Most important) Implementation of commonly used statistical (machine learning, data mining, if you would like to call them) models, including their mathematical details and programming implementation in C(!)

Besides I will also share miscellaneous staffs, including beautiful proofs, some newly developed (bad) marketing models, comments, compiler construction etc.

Why the haste

It is, in my opinion, very important to learn both mathematics and programming systematically to be a good data analyst. However, not everyone (if not no one at all) has the luxury to do a master in mathematics and computer science. In fact, I know many colleagues that are forced to learn data analysis from ground zero, within several months, if not weeks. Consequently, they have to implement everything in a fast and dirty way. While resulting models and algorithms might be extremely premature, it does pose another possible way to learn data analysis, majorly by studying and modifying existing implementations. While this is certainly not the most desirable way to enter a field, it does coincide with my opinion that mathematics and programming are skills instead of pure knowledge accumulation, and skills can be only sharpened by practice.

Why C (but not R/python)

R/Python is great language for data analysis. However, its greatness majorly lie in the abundance of available packages. In this sense, learning to call routines written in R/Python is probably THE fastest way of presenting some data analysis result. `

Unfortunately, the difficulty of statistical analysis usually lies in understanding the nature of the models, but not how to call certain routines in R/Python. In this sense, since the details of R/Python routines are usually totally hidden, it does not help much in understanding the models.

There are also other reasons in using C, including

C is fast It is actually not much more difficult to implement algorithms in C than in R/Python (although C is a little bit verbose) C routines can be easily incorporated into other languages, such as Java (via e.g. Swig), Python(via Cython) and etc. The writing styles

All the entries of this blog will be as short as possible, except this introduction. This also means this blog will be less entertaining than others. In some way, this blog is to be read as a technique manual.

Models that will be implemented Non-probability based models: Trees, SVM, Neural Network, Boosting, Model Averaging Probability-based models: discrete choice model, count regression, survival analysis, mixed and random coefficient models, HMM models. Both the frequentist estimation methods (MLE) and Bayesian Methods (Marjorly MCMC) will be discussed Linear models: Linear Regressions, GMM models, Structural Equation Models (LISREL) Models for High Dimensional Data Set: “LASSO”-type models ($L_1$ penalties and others), both coordinate decent algorithm and path algorithm will be discussed. Bayesian shrinkage models will also be discussed Non-parametric (Semi-parametric) models: Density estimation using kernel methods, regression with splines.

Please note this is not an exhaustive list. New models will be added once fit. Furthermore, many of the above methods can be combined to yield more powerful models.

Notes Since the rational is “fast and dirty”, no robust checking will be performed. It is suggested the readers should try to derive and implement the models themselves before looking at the “solutions” In reality, it should be noted that one often encounters situations where no existing models are appropriate. This is also why I put the focus on deriving the models instead of presenting them as a fact. Why not Chinese

Related contents of such are mostly in English. Since this blog is not to be exhaustive, it is actually much more difficult for readers to search for relevant English literature after reading Chinese ones. The readers, should not be frightened away as the difficulty lies mostly in understanding the mathematics and the programs, regardless of the explanation. However, if time allows, some entries might be translate into Chinese.

Impatient Data Analysis Series：Introduction

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本