Note: requests, bs4, numpy are not available by default on CAEN. To use them please install them in
a virtual environment. You can use the following sequence of commands:
python3 -m venv env/
pip3 install bs4,requests,numpy
[50 points] Web crawler.
Write a Web crawler that collects the URL of webpages from the umich domain. Your crawler will
have to perform the following tasks:
a. Start with https://eecs.engin.umich.edu/
b. Perform a Web traversal using a breadth-first strategy. You should only crawl html webpages.
c. Keep track of the traversed URLs, making sure:
– they are part of the eecs.engin.umich domain (or the old eecs.umich domain). This will include
any pages under eecs.umich, eecs.engin.umich, ece.engin.umich, cse.engin.umich
– they were not already traversed (i.e., avoid duplicates, avoid cycles)
d. Stop when you reach 2000 URLs.
Write a program called crawler.py that implements the Web crawler. The program will receive two
arguments on the command line:
● the first one consists of the name of a file containing all the seed URLs (for the purpose of
this assignment, this file will only contains one seed, namely http://eecs.engin.umich.edu);
● the second one consists of the maximum number of URLs to be crawled (for the purpose of
this assignment, the value of this parameter will be 2000)Use a try-except block to prevent network issues that may crash your code.
The crawler.py program should be run using a command like this:
% python crawler.py myseedURLs.txt 2000
It should produce two files:
● A file called crawler.output including a list of all the URLs being identified by the crawler, in
the order in which they are crawled. E.g.:
● A file called links.output, including all the links (in the form of (source_URL, URL)) that you
identify in your crawl, which will serve as the directed edges in the graph representation for
the next problem. E.g.,
[50 points] PageRank.
Implement the PageRank algorithm and apply it to determine the PageRank score for each of the
2000 URLs you crawled. The PageRank implementation should assume:
– An initial value of 0.25 for all the URLs
– A value of 0.85 for d
– Convergence when the difference between the scores obtained with two iterations of
PageRank for each of the URLs falls below 0.001.
Write a program called pagerank.py that implements the PageRank algorithm. The program will
receive three arguments on the command line:
● the first one consists of the name of the file containing all the URLs (for the purpose of this
assignment, this file will contain all the URLs you identified in the previous problem, i.e.,
● the second argument is the name of the file containing all the links in the form of
(source_URL, URL) pairs (for the purpose of this assignment, this file will contain all the
URLs you identified in the previous problem, i.e., links.output);
● the third one consists of the convergence threshold for PageRank (for the purpose of this
assignment, the value of this parameter will be 0.001).
The pagerank.py program should be run using a command like this:
% python pagerank.py crawler.output links.output 0.001
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx