COMP20008 Elements of Data Processing Assignment 1
Due date March 3, 2021
The assignment is worth 20 marks, (20% of subject grade) and is due 8:00am Thursday 1st April 2021 Australia/Melbourne time.
Background Learning outcomes
The learning objectives of this assignment are to:
- Gain practical experience in written communication skills for documenting for data science projects.
- Practice a selection of processing and exploratory analysis techniques through visuali- sation.
- Practice text processing techniques using Python.
- Practice widely used Python libraries and gain experience in consultation of additional
documentation from Web resources.
There are three parts in this assignment, Part A, Part B, and Part C. Part A and Part B are worth 9 marks each and Part C is worth 2 marks.
Part A (Total 9 marks)
For Part A, download the complete “Our World in Data COVID-19 dataset” (“owid-covid-
Part A Task 1 Data pre-processing (3 marks)
Program in python to produce a dataframe by
- (2 marks) aggregating the values of the following four variables:
total cases new cases
total deaths new deaths
by month and location in the year 2020.
The dataframe should contain the following columns after completion of this sub-task:
total cases new cases
total deaths new deaths
Note: if there are no entries for certain combinations of locations and months, there should be no entry for those combinations in the dataframe.
- (1 mark) adding a new variable, case fatality rate, to the dataframe produced from sub-task 1. The variable, case fatality rate, is defined as the number of deaths per confirmed case in a given period. Do not impute missing values.
The final dataframe should contain the columns in the following order:
case fatality rate total cases
and the rows are to be sorted by location and month in ascending order.
COMP20008 2021 SM1
Print the first 5 rows of the final dataframe to the standard output.
Save the new dataframe to a CSV file named, “owid-covid-data-2020-monthly.csv” in the same directory as the python program. Your program should be called from the command line as follows:
python parta1.py owid-covid-data-2020-monthly.csv
Hint: You will need to use appropriate functions for the aggregation based on your under- standings of the variables.
Part A Task 2 Visualisation (2 marks) Program in python to produce two scatter plots:
- (1 mark) a scatter plot of case fatality rate (on the y-axis) and confirmed new cases on the x-axis) by locations in the year 2020.
Output the plot to scatter-a.png in the same directory as the python program.
- (1 mark) a second scatter plot of the same data with only one change: the x-axis is changed to a log-scale.
Output the plot to scatter-b.png in the same directory as the python program. For this plot, apply preprocessing if necessary.
Your program should be called from the command line as follows:
python parta2.py scatter-a.png scatter-b.png
Part A Task 3 Discussion and visual analysis (4 marks)
A short report of your visual analysis of the two plots produced from Task 2.
It is expected that the visual analysis would include:
- (1.5 marks) a brief introduction/description of the raw data, including the source, any limitations you observe in the data and all preprocessing steps taken on the raw data to produce the visualisations,
- (1.5 marks) explanation of the plots and patterns observed, and
- (1 mark) a discussion contrasting the two scatter plots.
The report is to be 500 – 600 (maximum) words excluding figures, about 1 page, in pdf format, and must include the two plots, scatter-a.png and scatter-b.png, produced from Part A Task 2.
The filename of the report must be “owid-covid-2020-visual-analysis.pdf ”.
Part B Task 1: Regular Expressions (1 mark)
Each article contains a document ID which uniquely identifies the document. This document ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally ending in a letter. For example, each of the following are valid document IDs:
ABCD-123V XKCD-999A COMP-200
The document IDs are not located in a consistent place in each article. Use a regular expres- sion to identify the document ID for each document in the dataset. Write a Python program in partb1.py that produces a CSV file called partb1.csv containing the filenames and Doc- ument IDs for each document in the dataset. Your CSV file should contain the following columns in the order below:
Your program should be called from the command line along with the name of the CSV file: python partb1.py partb1.csv
Part B Task 2: Preprocessing (1 mark)
We now wish to perform the following preprocessing on each article in the cricket folder in
order to make them easier to search:
- Remove all non-alphabetic characters (for example, numbers and punctuation charac- ters), except for spacing characters such as whitespaces, tabs and newlines.
- Convert all spacing characters such as tabs and newlines to whitespace and ensure that only one whitespace character exists between each word
- Change all uppercase characters to lower case
Create a Python program in partb2.py that performs this preprocessing.
Your program should be called from the command line along with the filename of a document. For example:
python partb2.py cricket001.txt
Your program should then load the specified file, perform the preprocessing steps above and print the results to standard output.
Hint: You may wish to create a function for performing this preprocessing as you will need to perform this pre-processing as part of each task in Part B
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx