Subject Code :- DTSC12/71-200
Title :- Data Science
Assessment Type :- Assignment 1
Assignment description :-
You are hired as a consultant to work with data for a bank. The bank has a dataset that classifies potential customers into two classes: those earning more than 50K a year and those earning less than that. The bank wants to approach those who earn more to offer them its exclusive services (including loans).
DTSC12/71-200 Data Science Assignment 1
The dataset can be found at UCI:
The dataset has 48842 instances already split into training and test sets.
The format of the datasets is in csv we need to rename the files to create the extension .csv).
There are 14 attributes and there are missing values in some columns rows. The missing values can be considered a category on their own with in that column OR the rows with missing values can be deleted from the data set.
See the explanation and the accuracy obtained by the author in the file adult.name(.txt).
The first 14 columns are the attributes features and the last column is the labelled class.
The dataset is already split into train-test (2/3, 1/3 random). The 48842 instances are split into train=32561 and test=16281. If the unknown values are removed, then 45222 instances are left, split into train=30162, test=15060.
DTSC12/71-200 Data Science Assignment 1
Your task in this assignment is to use Decision Trees to model the problem, and to come up with the best possible classification for the test set. The decision trees need to classify each instance into either >50K or <=50K for the predicted income. Remember that during training, only the training data can be used to build the decision trees. The final accuracy should be based on the test set and used to compare different models.
Deliverables : 2 pdf reports and the corresponding rmd files
The submission should include 2 pdf files, both produced with rmd files (e.g., with different chunk options). The 2 pdf files are described below:
DTSC12/71-200 Data Science Assignment 1
Technical Report :-
all the code and the results including partial results should be visible. This would be used by the data analytics team at the bank to compare to other methods they may use.
Management Report :–
a partial report with only the necessary items to help manager understand how the model might work for them. The management report should have no more than 3 pages.
Remember to ensure that you use rmd files to produce both reports, using different chunk options.
Reports not created directly from the rmd will be penalised.
DTSC12/71-200 Data Science Assignment 1
Tips :-
You should try to build more than one decision tree, either using different parameters for the same package or using different packages (different algorithms) that we explored in the workshops.
1. Build more than one decision tree.
2. Compare the performance of the trees and inform in the report which one worked best (adopted model).
3. In the reports explain the reasons you have chosen a particular model.
4. For the management report, try to use simple words (avoid jargon) and graphs/tables (visualisations).
5. You are allowed to reuse code from the workshops. You can also look for help on the Internet but remember to follow the “Academic Integrity Guidelines in Coding” (pdf file in the assessment section).
6. For best results read carefully and follow the rubric for this assignment.
7. For DTSC71-200 build at least 3 models.