Analytics Adventures

The Practical Guide to the Random Forests - Random Forests Model Part II
(Using Kaggle’s Titanic Data with R code)

Photo source: https://datafloq.com

Previously in this series:

Random Forests

在建立Tree-based models的流程中使用Bagging演算法的Bagging trees模型, 導入了隨機元件(random component)的架構, 大幅改善了模型建立的變異,因而提升了預測績效。然而, Bagging trees模型中每一個tree model彼此之間並非是完全獨立的(independent),因為, 每一個tree model在進行每一次split時, 所有資料的欄位變數(original predictors)都會被分析考慮。假設今天訓練資料集, 其欄位變數和response之間的關係(relationship)可以用tree model所建立, 則以此資料數據所產生的每一個bootstrap samples產生之tree model結構將會非常類似(尤其位於tree model的top nodes); 我們稱此特性為”Tree Correlation”。

繼續閱讀...

The Practical Guide to the Random Forests - Random Forests Model Part I
(Using Kaggle’s Titanic Data with R code)

Photo Source: http://www.greenpeace.org/canada/Global/canada/image/2010/4/teaser/boreal/BOREAL%20FOREST%204.jpg

Previously in this series:

這篇Post是此系列的最後一篇, 我們終於能開始進入Random Forest這個強大的Ensemble技術領域。在所參考程式碼的Trevor Stephens文章中,尚有多出一篇討論Feature Engineering的post在Random Forest模型介紹之前, 由於這系列Posts重點在於模型架構的闡釋, 所以, 雖然Feature Engineering在Data Science分析流程中是一個重要的關鍵 - Feature Engineering的好壞對於模型績效有顯著的影響(feature engineering has been described as “the most important factor” in determining the success or failure of the predictive model), 我還是決定將Feature Engineering這一塊用重點條列的型式先行帶過。各位若有興趣, 可以自行參考Trevor Stephens的original post (http://trevorstephens.com/post/73461351896/titanic-getting-started-with-r-part-4-feature)。

Feature Engineering這個Topic, 將另行撰寫Posts討論。

繼續閱讀...

The practical guide to Random Forests – Decision Tree Model Part II
(Using Kaggle’s Titanic Data with R code)

Previously in this series:

讓我們回到Titanic資料集, 使用上一個Part學到的CART model來進行分析預測。這裡的程式範例一樣參照Trevor Stephens的文章(http://trevorstephens.com/post/72923766261/titanic-getting-started-with-r-part-3-decision ), 採用rpart這個package進行模型建置. rpart代表“Recursive Partitioning and Regression Trees”, 實現(implement)上述所闡述CART模型演算法。

library(rpart)
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train, method="class")

除了上一篇所提到的四個變數 – Pclass, Sex, Age和Fare, Trevor Stephens另加入了三個變數進行模型訓練。3個變數的代表意義如下:

sibsp Number of Siblings/Spouses Aboard

parch Number of Parents/Children Aboard

embarked Port of Embarkation

繼續閱讀...

訂閱：意見 (Atom)

The Practical Guide to the Random Forests - Random Forests Model Part II
(Using Kaggle’s Titanic Data with R code)

Random Forests

The Practical Guide to the Random Forests - Random Forests Model Part I
(Using Kaggle’s Titanic Data with R code)

The practical guide to Random Forests – Decision Tree Model Part II
(Using Kaggle’s Titanic Data with R code)

Contact Me

About

Category

Archive

Pageview

Copyright

The Practical Guide to the Random Forests - Random Forests Model Part II (Using Kaggle’s Titanic Data with R code)

Random Forests

The Practical Guide to the Random Forests - Random Forests Model Part I (Using Kaggle’s Titanic Data with R code)

The practical guide to Random Forests – Decision Tree Model Part II (Using Kaggle’s Titanic Data with R code)

Contact Me

About

Category

Archive

Pageview

Copyright

The Practical Guide to the Random Forests - Random Forests Model Part II
(Using Kaggle’s Titanic Data with R code)

The Practical Guide to the Random Forests - Random Forests Model Part I
(Using Kaggle’s Titanic Data with R code)

The practical guide to Random Forests – Decision Tree Model Part II
(Using Kaggle’s Titanic Data with R code)