• Competence in using KDD software tools in medium to large databases.
• Competence in applying relevant techniques at each stage of the KDD process
• Ability to evaluate the suitability of software tools in the context of different data
• Competence in combining data manipulation and analysis approaches to improve
the quality of input data.
• Understanding and identification of problems in input data such as outliers,
missing data, unreliable data, differences in granularity, and others, and identify
an adequate strategy to deal with the problem data.
• Presentation of knowledge induced in a format suitable for the target audience
and for the particular application.
• To obtain an overall view of the complex process of Knowledge Discovery in
Databases and understand the need for a methodical approach to KDD.
• To explore tools and algorithms available to each stage of the KDD process.
• To gain experience of using KDD software tools in a medium sized database.
• To learn to combine data manipulation and analysis approaches to improve the
quality of input data.
• To produce a suitable report describing the methods applied and the discussion
of the findings
To be reassessed on this coursework, please reuse the Lending Club Loan Dataset
uploaded on Blackboard as ‘LendingClubLoans2018-2020.xlsx’, containing loan
applications related data the company received in years 2018 to 2020.
Your task is to accurately classify the loan status (Current, Fully Paid, charged off,
late etc.) from the given fields/features and then hep the lending club predict
applications that may default/be late in paying or can be identified as potential bad
As this is the same dataset, your submission should be an improved version of any
previous submission of your report or a new submission if you have not submitted
The components of your reassessment task are as follows:
1. Introduce the dataset for the reader.
2. Undertake any cleansing or pre-processing you think is necessary on the
dataset. In your report, explain clearly what you have done and why you have done
it. Some cleaning could be to remove any feature/column if 60% missing values,
detect/remove outliers or to remove duplicate and highly correlated information.
3.Split the data into a training set and a test set once cleansing is done. Use suitable
toolkit and libraries (Python, Orange, Weka, or R whichever platform you are
comfortable with) to train models (e.g. Decision Tree, Random Forest or SVM) from
the training set to build the Loan Status Classifier.
Make sure to explicitly discuss methods applied to deal with class imbalance,
feature selection and other tuning or validation methods you used to improve the
quality of your developed models.
4. As part of your final report, describe and justify the decisions you have made as a
data scientist while processing this dataset, the results, and discuss the model’s
effectiveness in terms of classification report, confusion matrices, precision and
recall performances. Do present comparative analysis if you have trained multiple
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx