Your final document should be an ipynb or an html file generated from a Jupyter notebook. The file should be uploaded to this assignment on the course website.
In answering each of the following questions please include a) the question as a markdown header in your Jupyter notebook, b) the raw code that you used to generate any results, tables, or figures along with the results themselves, and c) the top ten or fewer rows of the dataframe wherever relevant (do not include more than ten rows for any table in your report).
Include any plots or figures generated from your code as well.
Part 1: Regression on California Test Scores
- Find the url for the California Test Score Data Set from the following website:
Read through the “DOC” file to understand the variables in the dataset, then use the following url to import the data.
The target data (i.e. the dependent variable) is named “testscr”. You can use all variables in the data except for “readscr” and “mathscr” in the following analysis. (These two variables were used to generate the dependent variable).
1.1 Visualize the univariate distribution of the target feature and each of the three continuous explanatory variables that you think are likely to have a relationship with the target feature.
1.2 Visualize the dependency of the target on each feature from 1.1.
1.3 Split data in training and test set. Build models that evaluate the relationship between all available X variables in the California test dataset and the target variable. Evaluate KNN for regression, Linear Regression (OLS), Ridge, and Lasso using cross-validation with the default parameters. Does scaling the data with the StandardScaler help?1.4 Tune the parameters of the models where possible using GridSearchCV. Do the results improve?1.5 Compare the coefficients of your two best linear models (not knn), do they agree on which features are important?
1.6 Discuss which final model you would choose to predict new data
Part 2: Classification on red and white wine characteristics
First, import the red and the white wine csv files into separate pandas dataframes from the following website:
(Note: you need to adjust the argument for read_csv() from sep=’,’ to sep=’;’)
Add a new column to each data frame called “winetype”. For the white wine dataset label the values in this column with a 0, indicating white wine. For the red wine dataset, label values with a 1, indicating red wine. Combine both datasets into a single dataframe.
The target data (i.e. the dependent variable) is “winetype”.
2.1 Visualize the univariate distribution of the target feature and each of the three explanatory variables that you think are likely to have a relationship with the target feature.
2.2 Split data into training and test set. Build models that evaluate the relationship between all available X variables in the dataset and the target variable. Evaluate Logistic Regression, Penalized Logistic Regression, and KNN for classification using cross-validation. How different are the results? How does scaling the data with StandardScaler influence the results?
2.3 Tune the parameters where possible using GridSearchCV. Do the results improve?
2.4 Change the cross-validation strategy in GridSearchCV from ‘stratified k-fold’ to ‘kfold’ with shuffling. Do the parameters for models that can be tuned change? Do they change if you change the random seed of the shuffling? Or if you change the random state of the split into training and test data?2.5 Lastly, compare the coefficients for Logistic Regression and Penalized Logistic Regression and discuss which final model you would choose to predict new data.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx