Linear Algebra
determinant
basic
data:image/s3,"s3://crabby-images/0debc/0debcd25ec3be084726f488373de4e29b2d7242a" alt="determinant formula"
determinant properties
online calculator
inverse/adjoint(adjugate) matrix
Only non-singular matrices have inverses. (det(A) != 0)
- minor matrix
- cofactor matrix
- adjoint/adjugate matrix
- inverse matrix
- conjugate matrix
eigenvalue/eigenvector
An*n
graph demo
eigenvector
eigenvector steps
svd
singular value decomposition
Am*n (m!=n)
svd
Probability Theory
random variable: discrete/continuous
- probability mass function: pmf (possion, binomial distribution ) for discrete random variable
- probability density function: pdf (normal,uniform) for contiunous random variable
- cumulative distribution function: cdf for discrete+contiunous random variable
see pmf-cdf-pdf
distribution-function-terminology-pdf-cdf-pmf-etc
binomial: n times Bernoulli trial, P(x=k)=C(n,k)* p^k * (1-p)^(n-k)
- marginal probability
- joint probability
- conditional probability
- bayes theorem
see here
Marginal probability: the probability of an event occurring (p(A)), it may be thought of as an unconditional probability. It is not conditioned on another event. Example: the probability that a card drawn is red (p(red) = 0.5). Another example: the probability that a card drawn is a 4 (p(four)=1/13).
Joint probability: p(A and B). The probability of event A and event B occurring. It is the probability of the intersection of two or more events. The probability of the intersection of A and B may be written p(A ∩ B). Example: the probability that a card is a four and red =p(four and red) = 2/52=1/26. (There are two red fours in a deck of 52, the 4 of hearts and the 4 of diamonds).
Conditional probability: p(A|B) is the probability of event A occurring, given that event B occurs. Example: given that you drew a red card, what’s the probability that it’s a four (p(four|red))=2/26=1/13. So out of the 26 red cards (given a red card), there are two fours so 2/26=1/13.
bayes theorem: p(cancer)=0.01, p(positive test|cancer)=0.9, p(positive test|no cancer)=0.08
p(cancer|positive test)?
basic-prob
Statistics
2 types of statistics
- descriptive statistics 描述性统计值
- inferential statistics 推理性统计值
descriptive statistics
basic
- n, sum, min,max, range =max-min,
- mean,median,mode
- variance,standard deviation
- skewness,kurtosis
from mean,median,mode
mean/median/mode
- mean: regular meaning of “average”
- median: middle value
- mode: most often
2 types of data set: here
- population: u,sigma^2, sigma —> parameter
- sample: x, s^2, s —> statistic
population是总体,总体的数据是不变的,u就代表总体真实的均值;
sample是样本,我们总体的数据很难得到,必须借助样本猜测总体的情况,但是每次采样的时候会有不同,因此x拔表示一次采样的均值;
不同采样的均值x往往不同,但是总体均值u一定是不变的。
population
data:image/s3,"s3://crabby-images/f9a2a/f9a2a2659be9c33281c177fff67df7a960b258a4" alt="population"
sample
data:image/s3,"s3://crabby-images/2ce4f/2ce4f807d7e7ca7456a690cbae21fe855dea0160" alt="sample"
see example
skewness vs kurtosis
- skewness: 偏度 the degree of symmetry
- kurtosis: 峰度 the degree of peakedness/flatness
data:image/s3,"s3://crabby-images/347d4/347d4ef800d9b5afd46c2542766b208fcb0a562f" alt="image"
data:image/s3,"s3://crabby-images/eee8f/eee8f2a22d8f980225d162fa45d0e657edab4ebf" alt="image"
formula see skewness kurtosis formula
inferential statistics
data:image/s3,"s3://crabby-images/fc997/fc997b8c03b285157e2673281c022ed7eef8ca52" alt="inferential statistics"
data:image/s3,"s3://crabby-images/868e7/868e71f464e5d9484e68e662dc61897a444b9c23" alt="inferential statistics"
Each hypothesis: null hypothesis + an alternative hypothesis.
- H0: u1=u2=u3=…=un. it indicates that the group means for the various groups are NOT very different from each other based on statistical significance levels.
- Ha: there exists at least two group means that are statistically significantly different from each other.
significance tests 显著性检验
- H0: there is NO real difference
- Ha: there is a difference
Reject H0 at 5% significant level if p-value<5%, statistical significant
Reject H0 at 1% significant level if p-value<1%, highly significant
one-tailed tests vs two-tailed tests
one-way ANOVA test:
- if p-value<=5%, the result is statistically significant different, we reject the null hypothesis in favor of the alternative hypothesis. (Ha was correct)
- Otherwise, if the results is not statistically significant, we conclude that our null hypothesis was correct. (H0 was correct)
demo
data:image/s3,"s3://crabby-images/e40f7/e40f7ae0a369a88d76bdf4e78531feda4da0f66d" alt="anova test"
F-stat>4.737 or p-value<0.05, then reject H0
data:image/s3,"s3://crabby-images/85bef/85bef1da4b312e8420e214d8325a2f54f0d0f1e6" alt="boxplot for anova test"
parametric tests vs nonparametric tests 参数检验 vs 非参数检验
Data Mining
- KDD: knowledge discovery of dataset
- CRISP-DM: cross-industry standard process for data mining 跨行业数据挖掘标准流程
data:image/s3,"s3://crabby-images/5268e/5268edaed7d477cb2acae18f4f2b9f62b20ec26a" alt="CRISP-DM_Process_Diagram.png"
Machine Learning methods
with/without labels
- supervised learning:
- classification
- regression
- unsupervised learning
- clustering
- dimensionality reduction
- anomaly detection
- assiciation rule-mining/market basket analysis(购物篮分析)
- semi-supervised learning
- reinforcement learning
online/offline
- batch learning/offline learning
- online learning
instance/model
- instance based learning
- model based learning
EDA
statistics
- descriptive statistics
- inferential statistics
analysis
3 types
- univariate analysis: n=1
- bivariate analysis: n=2
- multivariate analysis: n>=3
data:image/s3,"s3://crabby-images/28da4/28da461475adec17a5b5433e164f9d90a9283274" alt="3 types of analysis"
- use histogram to visualize data
- correlation matrix/heatmap
Model Evaluation
Classification
confusion matrix
- accuracy
- precision
- recall
- F1-score: harmonic mean 调和平均值
value range (0-1), the bigger, the better.
data:image/s3,"s3://crabby-images/891d5/891d5c35419933379953f90fb3d13a704607bd65" alt="confusion matrix"
precision vs recall curve
data:image/s3,"s3://crabby-images/11eb2/11eb2209b289b12f167b4e85be0806bb581db0d8" alt="precision vs recall"
another curve
- roc: receiver operating characteristic 接受者操作特征. TPR vs FPR curve
- auc: area under curve. value range (0-1), the bigger, the better.
data:image/s3,"s3://crabby-images/c4e3a/c4e3a0b5bfafe81c5e8cfbf6df56445d1af97f18" alt="roc basic"
data:image/s3,"s3://crabby-images/bffd0/bffd035b386815a1d5914c2bd473b885d82bc9fe" alt="roc"
data:image/s3,"s3://crabby-images/482c1/482c13368bf8c45a348db8e3e92df96103498a8d" alt="auc"
data:image/s3,"s3://crabby-images/9e90f/9e90f6cccd173bf13cf5951667dc62467af92bc6" alt="roc demo"
all in one
data:image/s3,"s3://crabby-images/b6def/b6def558dc62598295bcc0bd93626eac11c4be4f" alt="roc, precision recall, f1-score"
multi-class classification for ROC
- micro-averaging: treat as binary
- macro-averaging: equal weight
data:image/s3,"s3://crabby-images/4460a/4460af1ca0261d366f50c6c71592e29c6f57318b" alt="roc for multi-class classification"
Clustering
types
- partition/centroid based clustering: k-means,k-medoids
- hierachical clustering: AgglomerativeClustering, affinity propagation
- ward/single linkage
- averate linkage
- complete linkage
- distribution based clustering: gaussian mixture models
- densitity based clustering: DBSCAN, OPTICS
data:image/s3,"s3://crabby-images/72ce2/72ce20f6556c132cbe57bc30fbd456c275e05574" alt="clustering"
data:image/s3,"s3://crabby-images/92b1f/92b1f00bf9abd0a33b7bb093b247b8fdb7d6b088" alt="clustering"
data:image/s3,"s3://crabby-images/be0d6/be0d66a49eabc6f31b38c2e759124a8d91e8a367" alt="partition based clustering"
data:image/s3,"s3://crabby-images/9033b/9033b938519f5f9c631d7c75292991fcf07d3d06" alt="hierachical clustering dendrogram"
data:image/s3,"s3://crabby-images/7b384/7b384cbc29945a317eb0dcf6c8af16f6cb3a0b47" alt="linkages"
external validation
with labels
- homogeneity
- completeness
- v-measure: harmonic mean 调和平均值
value range (0-1), the bigger, the better.
data:image/s3,"s3://crabby-images/eadad/eadada23677de72e06c1b5c03407a3d5731968eb" alt="homogeneity completeness"
data:image/s3,"s3://crabby-images/fd91f/fd91f2a38f5330212ccda72c96deb23e29ef31b4" alt="v-measure"
internal validation
no labels
2 most important traits:
- compact groups
- well seperated groups
metric
- silhouette coefficient: SC轮廓系数. value range (-1-1), the bigger, the better.
- calinski-harabaz index: chi指数 value range >0 , the bigger, the better.
data:image/s3,"s3://crabby-images/f118d/f118d8b0d9ad5a94ba88e569936a082eb850c141" alt="sc"
data:image/s3,"s3://crabby-images/fd9f0/fd9f0047b261e9dfc0451bbb01d7322ab48f5d12" alt="sc"
data:image/s3,"s3://crabby-images/9cb43/9cb43b5a9baa910057261a2db66a313d5c9f5b58" alt="sc vs number of clusters"
data:image/s3,"s3://crabby-images/49081/490811295058efb91d343a8d8becb0f021f7c1c9" alt="sc and chi"
Regression
metric:
- mean squared error: MSE
- root mean squared error: RMSE
- coefficient of determination (R^2):判定系数
- coefficient of correlation (r):相关系数 value range (-1,1)
R2: value range (0,1), the bigger, the better.
for simple linear regression, R^2 = r^2
formula:
data:image/s3,"s3://crabby-images/a2b60/a2b604f3e4502181a6c2b1a2776ae8be55fd943b" alt="r2 formula"
correlation coefficient
data:image/s3,"s3://crabby-images/c3515/c3515457c54739f4a48663e604e2c5d86a30d2c9" alt="r demo"
r2 demo
data:image/s3,"s3://crabby-images/924bb/924bbfead20f83ab2d6c6cce718d37b4b966337e" alt="r2"
data:image/s3,"s3://crabby-images/752c0/752c0b6d04f248d66777da752a4ff108646a9930" alt="r2 demo"
images from bing search.
regression analysis
types
- simple linear regression
- multiple linear regression
- nonlinear regression
assumptions
- training dataset(sample) is representative of the population being modeled
- x1,x2,…,xn are linearly independent. no multicollinearity 非多重共线性
- homoscedasticity of error 同方差性: residuals being random and no any patterns
multicollinearity 多重共线性: correlation matrix
variance inflation factor (VIF)方差膨胀因子 VIFi = 1/(1-Ri^2). VIF越大,显示共线性越严重。经验判断方法表明:当0<VIF<10,不存在多重共线性;当10≤VIF<100,存在较强的多重共线性;当VIF≥100,存在严重多重共线性。
homo-scedastic(同方差) vs hetero-scedastic (异方差性): residual plot
homogeneous vs heterogeneous 同质的vs异质的
data:image/s3,"s3://crabby-images/90966/9096615db8920b9e388ee882130d035cd693074f" alt="correlation matrix/heatmap"
data:image/s3,"s3://crabby-images/71552/7155273466eb7fc1528161cca6246febebad74d4" alt="VIF"
data:image/s3,"s3://crabby-images/6dbbe/6dbbeaf98baae4fe1b6f4e868465a206198518b5" alt="homoscedasticity"
data:image/s3,"s3://crabby-images/9dec2/9dec2d8898b0d80381f18deed8b888648418c222" alt="homoscedasticity"
evaluation analysis
- residual analysis
- normality tests (Q-Q plot)正态分布检验
- R^2
data:image/s3,"s3://crabby-images/14423/144230345e183b7c9ebf0ac415794b40ff19d76d" alt="QQ-plot"
linear regression
y = kx + b, use OLS
decision tree based regression
linear vs non-linear regression:
- linear regression
- decision tree based regression (non-linear)
decision tree can be used for both classification and regression. CART
node splitting
for regression:
- MSE: mean squared error
- RMSE: root mean squared error
- MAE: mean absolute error
- MAPE: mean absolute percentage error
data:image/s3,"s3://crabby-images/3eec9/3eec9b5b14e2ee6553d29e27f4aec92e7e33bdfa" alt="regression"
data:image/s3,"s3://crabby-images/9e734/9e734421d00335bb3508adc51057f2fbaa5edb14" alt="mse and mae"
for classification
- information gain(entropy): 信息增益(熵)
- gini impurity/index: GINI 基尼不纯度
- misclassification error:
data:image/s3,"s3://crabby-images/877a7/877a7ac368974340065a2e0761e3221e6a410ebf" alt="ig and gini"
data:image/s3,"s3://crabby-images/5fecc/5feccd0077b103622a39e3ad246a72ace23061be" alt="bad vs good split"
stoppint criteria
- max depth
- min samples to split internal nodes
- max leaf nodes
use GridSearch to search for optimal hyperparameters
decesion tree algorithms
ensemble learning
3 major families:
- bagging: boostrap aggregating, boostrap sampling(自助采样法) eg. RandomForest
- boosting: eg. Gradient Boosting Machine(GBM), AdaBoost
- GBM variant: LightGBM, Extreme Gradient Boosting(XGBoost)
- stacking
others
- binning
- blending
- averaging
- voting
see What is the difference between Bagging and Boosting?
see 集成学习-Boosting,Bagging与Stacking
boostrap aggregating/bagging
data:image/s3,"s3://crabby-images/f2514/f2514dbfba71e7798b1a15f1107fe3f0aa527967" alt="boostrap aggregating/bagging"
boosting
data:image/s3,"s3://crabby-images/0446f/0446f3a7d77cf8a32f5073ac51096c49d8a50bc0" alt="boosting"
data:image/s3,"s3://crabby-images/daa89/daa89607561b8cb9cab03cc542c08e62ebe3821c" alt="boosting"
model stacking
data:image/s3,"s3://crabby-images/d56a1/d56a125d1417cb5c3f9ec9295168b9d7543f66a1" alt="stacking"
data:image/s3,"s3://crabby-images/28fda/28fda1f026f2a7af44bed51b79f9a78f9faa61b6" alt="stacking"
Model Tuning
decision trees
- information gain: IG 信息增益
- gini impurity: GI 基尼不纯度
bias-variance tradeoff
The main causes of error in learning are due to noise, bias and variance.
extreme cases of bias-variance
- underfitting: higt bias, low variance
- overfitting: lower bias, high vairance
bias-variance tradeoff
data:image/s3,"s3://crabby-images/47d7f/47d7f4abb032c9d372caa20d09ca0fc4b42fab58" alt="bias-variance"
data:image/s3,"s3://crabby-images/1e832/1e832a9262f784e407f9f8c91ace96f20ad55d4c" alt="bias-variance model complexity"
see learnopencv
cross validation
train/validation/test
cross validation strategies:
- leave one out CV: n-1 samples as train, 1 sample as validate
- k-fold CV: split into k equal subsets. k-1 subsets as train, 1 subset as validate
5-fold, 10-fold in pratice
hyperparameter tuning strategies
- grid search: manually specifying the grid, parallelizable
- randomized search: automatic
Model Interpertation
tools
global vs local interpertation
- global interpertation: based on the whole dataset (feature_importance, partial_dependence plot)
- local interpertation: based on a single prediction
global interpertation
data:image/s3,"s3://crabby-images/7cb5c/7cb5c827d16ceb1b385d55e241237d8fad8f0b15" alt="feature_importance"
data:image/s3,"s3://crabby-images/ffcf7/ffcf73f4125d0e1dd100a819bb52c500986e66c6" alt="one-way partial_dependence plot"
data:image/s3,"s3://crabby-images/e0a3b/e0a3b16204c0b864ad9f966ef1a3612e8aab5aa3" alt="two-way partial_dependence plot"
local interpertation
data:image/s3,"s3://crabby-images/132fe/132feb5359dbe8f640617434eb44809d394c3a4f" alt="local interpertation"
model decision surface/ hypersurface
data:image/s3,"s3://crabby-images/5bd79/5bd7937d141f9ccbd3f7ea03e259bb6f83419376" alt="model decision surface"
Model Deployment
- rest api
- micro service
- model deployment as a service, anything as a service(XAAS)
Real-world case studies
customer segmentation
clustering problem
factors
- geographic 地理因素
- demographic 人口统计因素
- psychographic 心理因素
- behavioural 行为因素
data:image/s3,"s3://crabby-images/5bf1e/5bf1eb42266ec964c9367cb9b9abe552ee089b0e" alt="customer segmentation"
RFM Model for customer value
- recency
- frequency
- monetary value
data:image/s3,"s3://crabby-images/39707/397076958a86c6a1b7c7d9eaf1adf959cf23f255" alt="RFM Model"
association-rule mining
assiciation rule-mining/market basket analysis(购物篮分析)
basics
- association rule: {item1,item2,item3 —> itemK}
- itemset: {milk,bread} {beer,diaper}
- frequent itemset: {milk,bread}
metrics
- support = frq(X,Y)/N
- confidence = support(X,Y)/support(X) = frq(X,Y)/frq(X)
- lift = support(X,Y)/(support(X)*support(Y)) = N*frq(X,Y)/(frq(X)*frq(Y))
good rules: large confidence, large support, lift >1
lift(X->Y) = 0 means X and Y not occur at the same time
lift(X->Y) = 1 means X and Y are independent of each other.
data:image/s3,"s3://crabby-images/9d891/9d891aec24d8e78cbeaa52d5984411efae2de59d" alt="support"
data:image/s3,"s3://crabby-images/c7d0b/c7d0b3f5390081dcda666682bbb06242ebc64908" alt="support"
algorithms
- apriori algorithm: generate all 2^k itemsets, TOO EXPENSIVE
- FP growth: no need to generate all 2^k itemsets, use special structure FP-tree, divide-and-conquer stragety
k unique products, then 2^k itemsets.
recommender system
recommender systems/ recommendation engines
big data with pandas
how to process big data with pandas ?
import pandas as pd
for chunk in pd.read_csv(<filepath>, chunksize=<your_chunksize_here>)
do_processing()
train_algorithm()
read by chunk
see opening-a-20gb-file-for-analysis-with-pandas
other tools
other refs
types of recommendation engines
3 types
- user-based recommendation engines
- content-based recommendation engines
- hybrid/collaborative filtering(协同过滤) recommendation engines
based on similarity
different cases
- popularity-based: most liked songs by all users
- similarity-based: similar songs for given user
- matrix factorization based: use svd to get low rand approximation of the utility matrix
similarity
- Jaccard Index/Jaccard similarity coefficient, (0-1)
- cosine similarity
Jaccard Distance = 1 - Jaccard Index
data:image/s3,"s3://crabby-images/52c54/52c5409522852492eafcd90ed5d3735d6a3f02f9" alt="Jaccard Index"
data:image/s3,"s3://crabby-images/08a0c/08a0c7d3bb397b89735f8ad0eb105cea6ddd8f22" alt="Jaccard Index"
data:image/s3,"s3://crabby-images/8cfbc/8cfbc34737fa247d38f03368e01f9f1fc5db4318" alt="demo"
matrix factorization
矩阵分解
use matrix factorization to discover latent features between two different kinds of entities
data:image/s3,"s3://crabby-images/1fef1/1fef1503914412d2c375b215d87b9b66d53a4bef" alt="utility matrix"
sparse matrix
data:image/s3,"s3://crabby-images/98b07/98b0708042832480c7627c635aa880de20903b67" alt="matrix factorization"
use SVD: matrix factorization, PCA
implicit feedback 隐式反馈: song play count—> likeness
recommendation engine libraries
- scikit-surprise (Simple Python Recommendation System Engine)
- lightfm
- crab
- rec_sys
time series forecasting
basics
predictive modeling
time series analysis/forecasting:
- traditional approaches
- Moving Average: MV
- Exponential Smoothing: EWMA
- Holt-Winter EWMA
- Box-jenkins methodologies: AR, MA, ARIMA, S-ARIMA
- deep learning approaches: RNN, eg. LSTM
- regression modeling (x1,x2,..x6,—>x7): many-to-one
- sequence modeling: squence -> sequence
two domains
- frequency domain: spectral and wavelet analysis
- time domain: auto- and cross-correlation analysis
where to get data ?
tools to fetch data:
- quandl: register for key first
- pandas-datareader
time series components
3 major components:
- seasonality
- trend
- residual
data:image/s3,"s3://crabby-images/4fcb6/4fcb6c244cacffe49f843bf21aef6a80b0d9a2c8" alt="major components"
smoothing techniques
- Moving Average: MV
- Exponential Smoothing: EWMA
ARIMA
AR vs MV
- auto regressive
- moving average
ARIMA: auto regressive integrated moving average
key concepts
- Stationarity(平稳性): One the key assumptions behind the ARIMA models. Stationarity refers to the property where for a time series its mean, variance, and autocorrelation are time invariant. In other words, mean, variance,and autocorrelation do not change with time
- Differencing(差分): differencing is widely used to stabilize the mean of a time series. We can then apply different tests to confirm if the resulting series is stationary or not.
- Unit Root Tests: Statistical tests that help us understand if a given series is stationary
or not.
ad_fuller_test
: The Augmented Dickey Fuller test begins with a null hypothesis of series being non-stationary
kpss_test
: while Kwiatkowski-Phillips-Schmidt-Shin test or KPSS has a null hypothesis that the series is stationary.
ad_fuller_test
data:image/s3,"s3://crabby-images/b3ae6/b3ae650864b7d8993aed67c61440ec8c631f5536" alt="ad_fuller_test 1"
not statistically significant, accpet H0: non-stationary
data:image/s3,"s3://crabby-images/cd97a/cd97a97bcd468fbae0d4ff1a651bd22c5ee47cfc" alt="validate 1"
data:image/s3,"s3://crabby-images/49569/49569f78f8eabfe80ea0b1f5b4d3a67aaaabc14c" alt="ad_fuller_test 2"
statistically significant, reject H0 and accept Ha: stationary
data:image/s3,"s3://crabby-images/3b436/3b4369ef1e733fa873469948bfa964c97a59aba9" alt="validate 2"
ARIMA(p,d,q)
model
where,
- p is the order of Autoregression
- q is the order of Moving average
- d is the order of differencing
how to choose p and q?
- ACF or Auto Correlation Function plot —> q = 1
- PACF or the Partial Auto Correlation Function plot —> p = 1
data:image/s3,"s3://crabby-images/4eab8/4eab8fb4c6576ab6a006c6066b44f50fca0b8f28" alt="ACF PACF"
use grid search to choose p and q based on AIC
AIC or Akaike Information Criterion measures the
goodness of fit and parsimony.
data:image/s3,"s3://crabby-images/932ef/932eff946d89e71da18b8ad8b3bcb9f2106213ea" alt="auto ARIMA"
LSTM
Efficient Market Hypothesis: which says that it is almost impossible to beat the market consistently and there
are others which disagree with it.
modeling
- regression modeling
- sequence modeling
data:image/s3,"s3://crabby-images/ae9cd/ae9cd0d1245698d983f7d1190a23a26867b91ecf" alt="regression modeling"
(N,W,F)
format as input
- number of sequence
- window: length of sequence
- features per timestamp
for regression
data:image/s3,"s3://crabby-images/08b58/08b58c8275096b1d629bd7e600a6e3e6e9725d2d" alt="regression"
for sequence
data:image/s3,"s3://crabby-images/eb71e/eb71e27fec0706b32e3d589db806cbcc440df5da" alt="sequence"
we need to pad test sequence to match input shape.
other time series tools
New Concepts
- Linear Discriminant Analysis(LDA)线性判别分析
- Quadratic Discriminant Analysis(QDA)线性判别分析
sklearn code
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Reference
History