# 编程代写｜ST2195 Programming for Data Science

## 这是一篇来自英国的关于的数据科学编程代写

Section A

1. Consider the following objects in Python: i=(2,3), j=[2,3] and k={2,3}. State whether of the following operation are possible (in Python). Justify your answers in one sentence. There is at least one correct statement, and negative marks apply for wrong choices.

(a) i=4

(b) print(j+1)

(c) k=1

(d) j=4

• Marks: 6

(a) Not possible. A tuple cannot be changed.

(b) Not possible. Cannot perform numerical operation to a list.

(c) Not possible. A set is not indexed.

(d) Possible. A list can be changed.

1. In which of the circumstances below do ridgeline plots provide the most appropriate choice? Provide justifification for your answer in no more than two sentences.

(a) When we want to study the empirical density of a variable.

(b) When we want to compare frequencies of one variable across difffferent categories of another variable.

(c) When we want to monitor changes in the distribution of a variable across difffferent categories of another variable.

(d) When we want to explore the association between two continuous variables

• Marks: 6

(c) Appropriate. It provides the empirical density of the continuous variable across the categories of the other variable.

1. Which of the statements below is correct. Provide justifification for your answer.

(a) When training a machine learning pipeline the main aim is to achieve a high training error.

(b) When training a machine learning pipeline the main aim is to achieve a moderate training error.

(c) When training a machine learning pipeline the main aim is to achieve a low training error.

(d) When training a machine learning pipeline achieving a low training error may not be the primary aim.

• Marks: 6

(d) The primary aim is to achieve a low test error ideally (but not necessarily) with a low training

as well.

1. Which of the following statements are correct? There is at least one correct statement, and negative marks apply for wrong choices.

(a) An IDE is an alternative operating system to Microsoft Windows.

(b) An IDE typically provides a source-code editor

(c) Some source-code editors provide auto-completion of code and syntax highlighting

(d) There are only 4 source-code editors for R and 3 for Python.

(e) A source-code editor for Python cannot be used for R

(f) An IDE is necessary for writing code.

• Marks: 6

(b), (c)

1. Which of the following statements are correct? There is at least one correct statement, and negative marks apply for wrong choices.

(a) Jupyter notebooks cannot handle Python code.

(b) R Markdown is an authoring framework that combines Markdown with R

(c) R Markdown fifiles cannot be opened without installing R fifirst.

(d) Jupyter Notebooks are open-source web-browser based applications.

(e) Jupyter notebooks were named after the fifirst names of its creators, Julia and Peter.

(f) R Markdown fifiles can be converted in a variety of formats including HTML, PDF, and Microsoft Word documents.

• Marks: 6

(b), (d), (f)

1. Note from which language (R or Python) each of the following code chunks is from:

C1. vec = c(1, 4, 7)

C2. paste(“Hello”, “world”)

C3. import numpy

C4. library(“mlr”)

C5. phrase = “Hello world”; print(phrase.lower())”

C6. vec = (1, 4, 7)

C7. list(“a”, 5, 1:3)

C8. [“a”, 5, (1, 2, 3)]

C9.

if (mark >= 50)

print(“pass”)

C10.

if mark >= 50:

print(“pass”)

C11. plot(1:5, 2:6)

C12. df = pandas.read_csv(fdate + ‘.csv’)

C16. plt.subplots()

C17. ggplot(df, aes(x = x)) + geom_histogram()

C18. write.table(df, file = “df.csv”)

C19. apply(df, 2, sum)

C20. df[-c(1, 3, 4), ]

• Marks: 10

R: C1, C2, C4, C7, C9, C11, C13, C14, C17, C18, C19, C20

Python: C3, C5, C6, C8, C10, C12, C15, C16

Section B

1. For each of the following statements about R, state if they are always correct or not. Provide justifification for your answer of no more than two sentences.

(a) A list is also a data frame.

(b) A data frame is also a list.

(c) A data frame can contain data objects of difffferent types such as vectors and matrices.

(d) Data frames can contain lists.

(e) We can select the elements of both lists and data frames using their names

• Marks: 10

(a) Not necessarily, a list can contain vectors matrices, data frames even other lists. This is not the case for data frames.

(b) Correct. It can be viewed as a list of vectors of potentially difffferent type.

(c) Incorrect. It can only contain vectors of difffferent types.

(d) Incorrect. In fact, lists can contain data frames.

(e) Correct. This can be done using the \$ sign.

1. For each of the following statements about R, state if they are always correct or not. Provide justifification for your answer of no more than two sentences.

(a) The rows of a table in a relational database represent records.

(b) An attribube in a relational database is a tuple of rows.

(c) SQLite uses a separate server process to operate.

(d) The SQL query

SELECT employee_id, salary, department

FROM employee

WHERE employee_id >= 102 AND salary >= 100

ORDER BY salary

returns all available records and attributes from the table employee that have employee_id greater or equal to 102 and salary greater or equal to 100, ordered in increasing salary.

(e) The following R code chunk

inner_join(employee, company, by = “sector”) %>%

filter(department == “HR”)

Find all records in tables employee and company that have matching values of sector, and return only those records where department is “HR”.

• Marks: 10

(a) Correct.

(b) Incorrect. An attribute in a relational database is a column of a table.

(c) Incorrect. SQLite, unlike other RDBMS does not require a separate server process to operate.

(d) Incorrect. The statement will return all available records for the attributes employee_id, salary, department under the stated conditions.

(e) Correct.

1. Explain in no more than 2 sentences, why the following statements are wrong.

(a) Git is a repository hosting service for GitHub.

(b) A Git repository cannot be accessed without an internet connection.

(c) The command git add is used to add another user in the repository.

(d) Structured data are stored in a local hard drive, while unstructured data are in the cloud.

(e) CSV fifiles are special instances of XML fifiles.

(f) y x returns y modulo x.

(g) A dictionary in Python is a collection that is ordered, unchangable and indexed.

(h) The command matrix(1:12, nrow = 3) in R will create a matrix with 3 rows and 1 column with elements 1, 2, 3.

(i) A data frame in R can only hold factors and numeric variables.

(j) Mutable objects in Python are objects whose value changes depending on the operations performed on them.

(k) ggplot2 is an R system for data wrangling.

• Marks: 10

(a) GitHub is a repository hosting service for Git.

(b) A Git repository can be setup and accessed in a local computer.

(c) git add can be used to add fifile contents to the index.

(d) Structured data is data that is organized according to a predetermined set of rules; unstructured data sets are data sets for which it is diffiffifficult to have a predetermined set of rules for organizing them.

(e) A CSV fifile (comma delimited fifile) is not a special instance of XML fifile. CSV can be used to represent structured data is in two-dimensional arrays; XML is a markup language that defifines a set of rules for encoding information in objects.

(f) y x returns y raised to the power of x

(g) A dictionary is a collection which is unordered, changeable and indexed.

(h) matrix(1:12, nrow = 3) creates a matrix with 3 rows and 4 columns.

(i) A data frame in R can hold multiple variable types.

(j) Mutable objects are objects whose calues can be changed (e.g. lists, sets and dictionaries).

(k) ggplot2 is an R packages for graphics.

1. Match the commands C1-C4 with the output in O1-O4.

C1. git status

C2. print(type(2.3))

C3. paste(“type”, “‘float'”)

C4. git checkout master

O1. “type ‘float'”

O2. <type ‘float’>

O3. Switched to branch ‘master’

O4. On branch master

• Marks: 10

C1 – O4, C2 – O2, C3 – O1, C4 – O3

1. Consider a data set consisting of the following variables on several customers of a bank:
• balance: credit card outstanding balance
• cleared: whether this balance was cleared in time
• student: whether the person is a university student
• income: the income of the person

(a) Describe what graphs you would produce to demonstrate how the variables balance, and income affffect the likelihood of the person clearing the credit card outstanding balance in time

(b) Suppose that when you look at frequencies students tend to clear their outstanding credit card balances in time less often than the rest of the population. But if you focus on people with high outstanding credit card balances, students are more likely to pay in time than the rest of the population. Describe why could this be the case and what graphics you will use to depict that.

• Marks: 10

(a) A scatter plot of balance and income can be used by labeling the points according to cleared.

Also, side-by-side boxplots or violin plots separately for categories of cleared.

(b) This could imply that the reason students tend to miss payment is because they tend to have higher balances, and not because they are students per se. When compared with other members of the population with high balances they are actually more reliable. To depict that, we can use a grouped barplot of the cleared proportions separately for cases with balance being below and above its median.

1. Suppose we are interesting in predicting a continuous variable y based on several features X. and we have a dataset with several missing values on X features. We are comparing two model learning models, namely ridge regression and random forests. Provide brief answers to the following:

(a) Provide the type of the machine learning models.

(b) Discuss the process of training the machine learning models indicating how the missing values will be handled.

(c) Discuss what criteria you will use to choose between the trained machine learning models.

• Marks: 10

(a) supervised learning, regression.

(b) The date can be split into a training and test set. The test set will be left aside and not used at all during the training process. The missing values can handled by one of the relevant methods, e.g. by fifilling with the sample average of this variable from the cases without missing values. It is important that this is done through a machine learning pipeline so that the sample average is calculate by points in the training dataset only. In order to tune the parameters of the machine learning pipelines with ridge regression and random forests, cross-validation can be used

(c) The trained and tuned machine learning pipelines will be used to obtain predictions to the test set (that was left aside in the process of part (b)). The best model will be the one with better predictive performance according to the mean squared error criterion. E-mail: vipdue@outlook.com  微信号:vipnxx 