This is the first of 3 “milestones” for the final project. Please read this entire document
This project can be done individually or in pairs (no groups larger than 2 are allowed). The
amount of effort for a pair project will be roughly double an individual project. You are on your
own to set up a group if you so desire (feel free to use the Piazza forum, etc.). Once you have
turned in this first assignment as a group, you must keep that group throughout the project.
In general, the focus of the project is to show you can acquire, model, store and process
multiple sources of data, and build reliable pipelines to do so. For this course, actual “analysis”
of the data is secondary; you’ll be expected to say something about the data, but your
conclusions are not the focus of the project.
Your project will be scored on a number of factors, including (but not limited to!) the
complexity and size of your datasets, the quality of your pipeline and modeling code, and the
writeup of your research statement and conclusions. It’s a sizable amount of work, but it’s your
chance to actually do something substantial with data, so I hope you have fun with it!
The project contains three “milestones”, outlined below. Each one will be turned in separately,
with the final submission being your final project. Note that the first two milestones are
submitted via our course website. The final submission is to be done via GitHub.
Briefly, milestones are:
1). Data set and problem selection
2). Data acquisition and modeling infrastructure
3). Research conclusions and writeup
You are welcome – and encouraged! – to move faster than the milestone schedule. Especially
in the beginning, it makes sense to write access code (i.e. scrapers and API crawlers) for your
data as soon as possible!
Homework 4/Project Milestone 1 (50 points)
Due Wednesday, November 13th at 11:59pm
Find three (3) data sets on the web from a topic area that’s interesting to you. (If there isn’t
any data that’s interesting to you, that’s a problem!):
1. One must be “scrape-able” (i.e. not available via an external API)
a. “Data set” is not just a simple web page
b. Needs to be something that requires automation to obtain
c. If you could just cut-and-paste the data, it’s not a data set
d. Data should be some what structured (maybe). If what you want to scrape are
large text blurbs or images, that’s fine, as long as there is an algorithmic way of
obtaining a large number of these items, while attaching meta-data to them.
e. Again, if you could obtain the data manually vs. writing a script, it doesn’t count!
2. One must be available via external public API
a. You have to be able to access it without a ton of trouble
b. You can use OAuth if needed, but if you can’t get at the API, it doesn’t count
c. If you decide to use an API that requires OAuth or some other authentication
mechanism, you must test that you can access this before hand
d. The API must allow a “reasonable” number of free accesses per
day/hour/lifetime, etc. Enough for you to be able to test and deploy
assignments and projects based on it
e. There is a list (by no means comprehensive!) of public APIs here:
i. Note which require authentication, and which do not!
3. The third can be either (API or scraped)
If you’re a Ph.D. student, something related to your research would be great! If you’re a MS
student, here’s your chance to work on data of your choosing! I don’t mind “double-dipping”,
and using this data for another course project, etc.
If you are doing the project as a pair, you will also need to do the following:
1. Come up with an additional 2 datasets (either via web scraping or API). One of these
datasets can be a statically downloaded set (i.e. CSVs from GitHub or some other
source). These datasets should all be related to the topic of your project, and connect
in some meaningful way.
The deliverable for Milestone 1 is a text file, submitted to the course website with the
For each data set, you need to include:
1. The URL/API endpoint of the data set. For APIs, include links to the API endpoint and
links to the documentation for the API (if you’re working as a pair, and have a static
data set, include the link to it).
2. A brief description of what the data set is/contains
3. A brief (4-6 sentence) description of how you might combine these datasets (i.e. how do
they relate to each other? What are the commonalities between them? How might you
connect them? How do they enrich each other?). For example, if you scraped census
data that contains a person’s “home town”, google maps API data, and data with
median income per zip code, you might discuss how you would use the google maps API
to translate the hometown to a particular zip code, and then combine that with the
4. A brief (4-6 sentence) description of what you might hope to find in the data overall.
Basically, what are you trying to accomplish in this research project?
Be sure to pick datasets that provide a meaningful amount of data, that you can actually access
(in terms of rate limits, etc.) and that you’re interested in working with for the duration of the
Project Milestone 2 (50 points)
Due Friday, November 30th at 11:59pm
In this assignment, you’ll take the data sets you described in the first milestone, and build
software to access, model and store the data.
Taking a closer look at each step:
1). Accessing the data: You’ll need to build scrapers and/or API crawlers for each data set you
described above. These will need to be robust against failure, and will need to respect API rate
limits, authentication, etc.
2). Modeling the data: Build a data model, using whatever method you prefer. This can be SQL
relationships, a Python class/object hierarchy, setting up Pandas dataframes, SQLalchemy, etc.
There are other options too! You have the freedom to interface with your data however you’d
like, but keep in mind that regardless of how simple you think the data is, your solution will be
graded on how useful, extensible, modular and robust your solution is. That means that if you
turn in a solution that is great for your data as it is, but fails if anything about your data
changes, that is not a great solution!
3). Store the data: This should be relatively straightforward based on your modeling decision.
SQL databases and Pandas dataframes have built-in capacities to store their data on disk, but
there are serialization options for any modeling approach you choose. Basically, you have to be
able to save your data and reload it ☺
Your code should be modular in that it allows you to obtain the data from the scraper/API but
also obtain it from local storage. How you implemented this (text files, CSV, cached webpages,
SQL files, Feather serialized dataframes, etc.) is up to you. Your main script should have a
–source=remote or –source=local
command line parameter, that chooses where to obtain the data from.
When invoked, your Python script should grab the data (either locally or remotely) and stick it
into your data model.
The deliverable for Milestone 2 is a Python script (or collection of Python scripts) submitted to
the course website in the following format:
Name your archive of files: LASTNAME_FIRSTNAME_hw5.zip
The main file to run should be called LASTNAME_FIRSTNAME_hw5.py. Following these naming
conventions is part of the grading process.
In addition, your zip file should turn in a plain text file named LASTNAME_FIRSTNAME_hw5.txt
that answers the following questions:
1. What are the strengths of your data modeling format?
2. What are the weaknesses? (Does your data model support? Sorting the information?
Re-ordering it? Only obtaining a certain subset of the information?)
3. How do you store your data on disk?
4. Let’s say you find another data source that relates to all 3 of your data sources (i.e. a
data source that relates to your existing data). How would you extend your model to
include this new data source? How would that change the interface?
5. How would you add a new attribute to your data (i.e. imagine you had a lat/long column
in a database. You might use that to access an API to get a city name. How would you
add city name to your data?)
The Final Project
Due the day of our final, Friday December 13th, at 1pm
[Hint: The most important things in the project are that you had a clear research plan, and you
turned in something that actually works. It’s better to have working code that comes to a
simple conclusion that an amazing research idea that throws an exception. Incrementally
debug your code!]
What conclusions you come up with is up to you, but basically you want to answer whatever
question originally prompted you to do the project in the first place. This could be a statistical
analysis, a visualization, a surprising feature of the data or something else entirely!
For this assignment, you’re to submit all code AND data you used for the project on GitHub. If
you’re working as a pair, each person needs to submit to GitHub, and both people need to
You should submit your project in the following format:
– project.txt (explained below)
– /data: Store the data you downloaded and wrote to disk in Milestone 2 here
– /src: Store your python files here
– environment.yml (optional). If you need packages that are not a standard part of
the Ananconda distribution, please create an environment file. Instructions for that
are here: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manageenvironments.html. You should also note this in #2 below.
The contents of project.txt should be the following:
1. The names of team member(s)
2. How to run your code (what command-line switches they are, what happens when you
invoke the code, etc.)
3. Any major “gotchas” to the code (i.e. things that don’t work, go slowly, could be
4. Anything else you feel is relevant to the grading of your project your project.
Also, answer some questions about the project itself:
5. What did you set out to study? (i.e. what was the point of your project? This should be
close to your Milestone 1 assignment, but if you switched gears or changed things, note
6. What did you Discover/what were your conclusions (i.e. what were your findings? Were
your original assumptions confirmed, etc.?)
7. What difficulties did you have in completing the project?
8. What skills did you wish you had while you were doing the project?
9. What would you do “next” to expand or augment the project?
Overall, the grading will be based on what your code looks like (is it well written, robust against
errors, well documented, well commented), the detail of your research plan (did you have
realistic goals? Did you describe your goals clearly? Was your topic properly scoped, I.e. not
too big, not too small, etc.) and you ability to follow directions.
One final hint: Debug your code incrementally. If your code does not run, it is unlikely you’ll
get a decent grade on the project, or the class in general. Always be working from a stable
source, and add features incrementally (re-watch the week 1 lecture for hints on how to do
Please contact the TA or I with any questions. And START EARLY ☺
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx