The key objectives of this assignment are to learn how to process messy text data using python. Data is always going to start in a form you may not want. So, you must massage the data into a form you do want. Python is a perfect tool for processing large volumes of raw data.
Australians love sports, so it seems ﬁtting to begin our data adventure by pro-cessing a bit of statistical data from Australian Rules Football. If you are un-familiar with the sport, you might want to watch this video https://https:// youtu.be/Dtmu-1kMFZw and/or read a bit about the rules on Wikipedia https: //en.wikipedia.org/wiki/Australian_Football_League.
The following template ﬁles are provided:
- A1.ipynb : a skeleton Jupyter notebook ﬁle you can use to interactively work through your initial testing to implement the functions as deﬁned below.
- create aﬂ tsv.py : a skeleton main ﬁle for Challenge 1. Once you have got your code working in the Jupyter notebook, you should move the code you need here so that it will “pass” a set of automated tests that we will run to verify your code can produce valid outputs for each of the function challenges you have been given.
- validate aﬂ tsv.py : a skeleton main ﬁle for Challenge 2. Same as above.
- ﬁnd top wins.py : a skeleton main ﬁle for Challenge 3. Same as above.
- aﬂutil/helper.py : Several helper utility functions that are provided to read / write ﬁles in a speciﬁc format. You should look at all of these functions carefully as they will help you solve the challenges.
- aﬂutil/AFLGame.py : This is a helper class that is used only in the third challenge.
- data directory : a set of markdown data ﬁles containing the input data to be used for this project.
- requirements.txt : this ﬁle should be used by you to ensure that you work in a clone of the anaconda environment. This should not be modiﬁed. It is critical that you use the environment as deﬁned or you will fail the automated test we will be running to validate your project code. See below for further instructions.
We realise that many of you are just getting used to python, and there is much to learn. So we have tried to provide you a well-deﬁned harness to guide you towards a working prototype. Once you have the data, you will apply some basic data analysis techniques to ﬁnd useful information in the data.
Creating Anaconda Environment
The ﬁrst task is to create the correct Anaconda working environment. The rules for this may differ a little depending on every OS, but you should be able to ﬁnd the right invocation for your platform of choice. I show the command line invocation here.
conda create -n PDSA1 python=3.8 conda activate PDSA1
pip install -r requirements.txt
That’s it. This will create a new environment you can start in Anaconda using “conda activate PDSA1” and exit from using “conda deactivate”. For the ﬁrst assignment you should not need anything except for the python core library and the packages related to Jupyter notebook.
The data you will be processing is data that has been crawled from the web to produce a set of markdown ﬁles. The data is raw statistical information for Australian Rules Football. You should study several of the input ﬁles in a text editor to get a feel for what you will have to parse. You will quickly start to recognize clear patterns in the line structure that you will exploit to quickly process thousands of lines of raw data. The two main ﬁle types are team-based statistics. The data is not clean, but it is well-formed. So this makes it very amenable to python data wrangling to extract and aggregate all of the data into a more usable form – a dataframe. pandas is a popular Python package that is used regularly in data science. So you will ﬁnd that dataframes are a very useful way to organize and process columns of data similar to a database table – without all the overhead of a full rbdms system. However, as many of you learning python still, we will not use a dataframe in this project. It would be relatively easy to convert the data array we are using in this project into pandas as you will have done all the hard work of cleaning the data, but we will save that challenge for another day.
When your program is combined with the datasets supplied, your program should produce the output as speciﬁed in the code template. The primary output format is actually a tsv ﬁle which is similar to a csv ﬁle, but uses tabs instead of commas to separate ﬁelds. When working with long strings of text, you are more likely to avoid collisions as tabs can easily be removed from a text ﬁle by replacing them with spaces, but not commas. Once you have tsv ﬁles (or even csv ﬁles), it is easy to serialize them out to a ﬁle for storage and reload them when you need the data again. You can also easily load some and not all of the data if there is a lot to process too. Do not change
the output functions provided, or you will fail to pass the automated harness tests. All you need to do is to implement the functions in the skeleton code. Once you get each of these functions to work, it will output the answers automatically. Each function is worth a subset of the the 30 possible points you can get on the project.
Processing Raw Team Statistics (15/30 marks)
The ﬁrst challenge is the core of our ﬁrst assignment. The basic idea is to process several semi-structured text ﬁles which contain outcomes for all AFL games for each team. For example, the start of the richmond.md ﬁle contains the following information:
| — |
| Richmond | | — |
| 2021 | | — |
| Rnd | T | Opponent | Scoring | F | Scoring | A | R | M | W-D-L | Venue | Crowd | Date |
| R1 | H | Carlton | 3.3 8.5 10.8 15.15 | 105 | 3.2 6.6 8.12 11.14
| 80 | W | 25 | 1-0-0 | M.C.G. | 49218 | Thu 18-Mar-2021 7:25 PM | …
There are only two ﬁelded line types, the ones containing header information like the ﬁrst four, or the core statistics lines with 14 ﬁelds of data. Our goal is simply to walk each of these ﬁles and create a single two-dimensional array (list of lists in python). Each line of the array will contain the following 15 cells:
- Team – Will be the the name and the second line of every ﬁle.
- Year – A header ﬁeld that will come before all of the round statistic lines.
- Round – Round or Playoff Round
- Where: T above and is an H (home), A (away), or F (ﬁnal/playoff)
- For Scoring – quarterly scoring like “1.3 5.4 8.6 13.12”. Here, 1 represents 1 goal scored and 3 behinds in the ﬁrst quarter. These are cumulative, and will be used to verify scores.
- For Total – total score by this team (F above).
- Against Scoring – this is the same as For scoring but for the opponent.
- Against Total – opponent total (A above)
- Result – Final outcome, which is a W (win), L (loss), or D (draw).
- Margin – difference between the two total scores.
- WDL – Current record of this team (not the opponent).
- Venue – stadium for game.
- Crowd – size of crowd.
- Date – date and time of game.
So, we can see that the columns we really want with the exception of team name and year which you can extract easily from the ﬁles are already the way we want them to be – we just need to split the lines into columns. We will talk in the lectorials and in practicals how to process a text ﬁle line by line as well as locate and split the lines we want using the line.split(’|’) command. This may seem daunting at ﬁrst, but once you get the hang of it, you will ﬁnd that parsing data ﬁles is surprisingly easy in Python. You just have to know what you need to extract and set up guards to ensure you ignore the lines you do not care about. You will want to ignore the header lines which repeat for each year and ignore two lines of cumulative statistics at the end of each year. These lines will look like this:
| Totals | 188.162 | 1290 | 187.177 | 1299 | P:16 W:7 D:0 L:9 | | 577583 | |
| Averages | 12.10 | 81 | 12.11 | 81 | | | 36099 | |
So if the ﬁrst column contains Rnd, Totals, or Averages, you want to skip over it. You will skip the separator lines like | — | too. Everything else you will want to carefully capture as you walk each ﬁle in order to build complete rows in the ﬁnal array.
Challenge 2 Validate total scores, margins, and outcome (10/30 marks)
This task will challenge you a little more. Several rows of data are not guaranteed to be correct in the raw data. However, there is always enough information in a row to validate and correct the errors you encounter. More speciﬁcally, the total scores for home and away teams may be incorrect, which means the margin and ﬁnal outcome may be wrong too. However, the quarterly scores for both home and away teams are always correct, so your goal is to parse these two ﬁelds and separate off the last recorded quarter score. Since these are cumulative, it is all you need to get the ﬁnal score for the team. For example, given a game scoring of “1.0 1.4 4.5 5.8”, you would separate off “5.8”, split 5 and 8, ensure they are integers and not strings, multiply 5 by 6 as each goal is worth a total of six points, and add 8 as each behind is worth only one point.
In summary, you need to check every For Total, Against Total, Margin, and Result column in every row to ensure that the scores are all correct, the margins are correct, and the ﬁnal outcome is correct. All changes used to update the current array and it is written out one last time.
Challenge 3 Find top ﬁve wins home and away (5/30 marks)
This increases the bar one more time and will require you to use a non-trivial data structure / function to ﬁnd the biggest wins of all time – home and away. This challenge is deﬁnitely easier if you know you have the correct margins for every game. If you do,
you just need to ﬁnd the rows with the largest margins for home wins and away wins in the data set.
Hint: There are deﬁnitely multiple ways to achieve this goal, but we will cover an example in the lectorials that shows a problem that is analogous to this one. So pay attention in lectorials and make sure you understand the heapq.nlargest call when we cover it.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx