The Practical Guide to the Random Forests - Random Forests Model Part II
(Using Kaggle’s Titanic Data with R code)
- Big Data Exploration Part I
- Big Data Exploration Part II
- Decision Trees Model Part I
- Decision Trees Model Part II
- Random Forests Model Part I
Random Forests
在建立Tree-based models的流程中使用Bagging演算法的Bagging trees模型, 導入了隨機元件(random component)的架構, 大幅改善了模型建立的變異,因而提升了預測績效。然而, Bagging trees模型中每一個tree model彼此之間並非是完全獨立的(independent),因為, 每一個tree model在進行每一次split時, 所有資料的欄位變數(original predictors)都會被分析考慮。假設今天訓練資料集, 其欄位變數和response之間的關係(relationship)可以用tree model所建立, 則以此資料數據所產生的每一個bootstrap samples產生之tree model結構將會非常類似(尤其位於tree model的top nodes); 我們稱此特性為”Tree Correlation”。繼續閱讀...
7/24/2016 | 標籤: Kaggle, Machine Learning, R | 0 Comments
The Practical Guide to the Random Forests - Random Forests Model Part I
(Using Kaggle’s Titanic Data with R code)
Photo
Source: http://www.greenpeace.org/canada/Global/canada/image/2010/4/teaser/boreal/BOREAL%20FOREST%204.jpg
Previously in this series:
這篇Post是此系列的最後一篇, 我們終於能開始進入Random
Forest這個強大的Ensemble技術領域。在所參考程式碼的Trevor Stephens文章中,尚有多出一篇討論Feature Engineering的post在Random
Forest模型介紹之前, 由於這系列Posts重點在於模型架構的闡釋,
所以, 雖然Feature Engineering在Data
Science分析流程中是一個重要的關鍵 - Feature Engineering的好壞對於模型績效有顯著的影響(feature
engineering has been described as “the most important factor” in determining
the success or failure of the predictive model), 我還是決定將Feature Engineering這一塊用重點條列的型式先行帶過。各位若有興趣,
可以自行參考Trevor Stephens的original post (http://trevorstephens.com/post/73461351896/titanic-getting-started-with-r-part-4-feature)。
Feature Engineering這個Topic, 將另行撰寫Posts討論。繼續閱讀...
7/24/2016 | 標籤: Kaggle, Machine Learning, R |
The practical guide to Random Forests – Decision Tree Model Part II
(Using Kaggle’s Titanic Data with R code)
Previously in this series:
讓我們回到Titanic資料集, 使用上一個Part學到的CART model來進行分析預測。這裡的程式範例一樣參照Trevor Stephens的文章(http://trevorstephens.com/post/72923766261/titanic-getting-started-with-r-part-3-decision ), 採用rpart這個package進行模型建置. rpart代表“Recursive Partitioning and Regression Trees”, 實現(implement)上述所闡述CART模型演算法。
library(rpart)
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train, method="class")
除了上一篇所提到的四個變數 – Pclass, Sex, Age和Fare, Trevor Stephens另加入了三個變數進行模型訓練。3個變數的代表意義如下:
sibsp Number
of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
embarked Port
of Embarkation繼續閱讀...
7/08/2016 | 標籤: Kaggle, Machine Learning, R | 0 Comments
訂閱:
文章 (Atom)