# 这是一篇关于运算科学的计算机编程代写

## Exercise 1

As part of this exercise we will be performing two operations on a large dataset using Numpy. First, considering an array of values, we will compute the percentage differential from one value to the next, as in the example below.

input_array = [43, 63, 72, 64, 16]

percent_diff = [(43-63)/43, (63-72)/63, (72-64)/72, (64-16)/64]

Second, we will compute the cummulative sum on an input array:

input_array = [43, 63, 72, 64, 16]

cumsum = [43, 43+63, 43+63+72, 43+63+72+64, 43+63+72+64+16]

Working code for both functions is provided in the following cells, the task is to re-write both functions to accelerate the computation. Several means of speeding up code have been covered during the lectures. Those include the use of caching, the use of native languages, the use of parallel computing, or the modification of the default algorithm. You can use one or several combined strategies.

The following cell’s code is from https://github.com/rakshitha123/TSForecasting/blob/master/utils/data_loader.py and enables the convert an input dataset in .tsf format into a pandas data frame.

In :

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

from numba import njit, prange

from datetime import datetime

from distutils.util import strtobool

​​

# Converts the contents in a .tsf file into a dataframe and returns it along

# with other meta-data of the dataset: frequency, horizon, whether the dataset

# contains missing values and whether the series have equal lengths

#

# Parameters

# full_file_path_and_name – complete .tsf file path

# replace_missing_vals_with – a term to indicate the missing values in series

# in the returning dataframe

# value_column_name – Any name that is preferred to have as the name of the

# column containing series values in the returning dataframe

def convert_tsf_to_dataframe(

full_file_path_and_name,

replace_missing_vals_with=“NaN”,

value_column_name=“series_value”,

):

col_names = []

col_types = []

all_data = {}

line_count = 0

frequency = None

forecast_horizon = None

contain_missing_values = None

contain_equal_length = None

found_data_tag = False

found_data_section = False

with open(full_file_path_and_name, “r”) as file:  # , encoding=”cp1252″

for line in file:

# Strip white space from start/end of line

line = line.strip()

if line:

if not line.startswith(“@data”):

line_content = line.split(” “)

if line.startswith(“@attribute”):

if len(line_content) != 3:  # Attributes have both name and type

raise Exception(“Invalid meta-data specification.”)

col_names.append(line_content)

col_types.append(line_content)

else:

if len(line_content) != 2:  # Other meta-data have only values

raise Exception(“Invalid meta-data specification.”)

if line.startswith(“@frequency”):

frequency = line_content

elif line.startswith(“@horizon”):

forecast_horizon = int(line_content)

elif line.startswith(“@missing”):

contain_missing_values = bool(strtobool(line_content))

elif line.startswith(“@equallength”):

contain_equal_length = bool(strtobool(line_content))

else:

if len(col_names) == 0:

raise Exception(“Missing attribute section. ”

“Attribute section must come before data.”)

found_data_tag = True

elif not line.startswith(“#”):

if len(col_names) == 0:

raise Exception(“Missing attribute section. ”

“Attribute section must come before data.”)

elif not found_data_tag:

raise Exception(“Missing @data tag.”)

else:

found_data_section = True

all_series = []

for col in col_names:

all_data[col] = []

full_info = line.split(“:”)​

if len(full_info) != (len(col_names) + 1):

raise Exception(“Missing attributes/values in series.”)​

series = full_info[len(full_info)  1]

series = series.split(“,”)​

if len(series) == 0:

raise Exception(

“A given series should contains a set of comma separated ”

“numeric values. At least one numeric value should be there ”

“in a series. Missing values should be indicated with ? symbol”

)

numeric_series = []​

for val in series:

if val == “?”:

numeric_series.append(replace_missing_vals_with)

else:

numeric_series.append(float(val))

if numeric_series.count(replace_missing_vals_with) == len(numeric_series):

raise Exception(

“All series values are missing. A given series should contains ”

“a set of comma separated numeric values. At least one numeric ”

“value should be there in a series.”

)

all_series.append(pd.Series(numeric_series).array)​

for i in range(len(col_names)):

att_val = None

if col_types[i] == “numeric”:

att_val = int(full_info[i])

elif col_types[i] == “string”:

att_val = str(full_info[i])

elif col_types[i] == “date”:

att_val = datetime.strptime(full_info[i], “%Y-%m-%d %H-%M-%S”)

else:

raise Exception(

“Invalid attribute type.”

)   # Currently, the code supports only numeric, string and date types.

# Extend this as required.

if att_val is None:

raise Exception(“Invalid attribute value.”)

else:

all_data[col_names[i]].append(att_val)​

line_count = line_count + 1​

if line_count == 0:

raise Exception(“Empty file.”)

if len(col_names) == 0:

raise Exception(“Missing attribute section.”)

if not found_data_section:

raise Exception(“Missing series information under data section.”)​

all_data[value_column_name] = all_series

return (

frequency,

forecast_horizon,

contain_missing_values,

contain_equal_length,

)​

# Example of usage

# loaded_data, frequency, forecast_horizon, contain_missing_values, contain_equal_length = convert_tsf_to_dataframe(“TSForecasting/tsf_data/sample.tsf”)

# print(frequency)

# print(forecast_horizon)

# print(contain_missing_values)

# print(contain_equal_length)

The following cell loads a dataset downloaded from https://zenodo.org/record/7371038. The following description is from the dataset webpage:

“This dataset contains 145063 time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2022-06-30. This is an extended version of the dataset that was used in the Kaggle Wikipedia Web Traffic forecasting competition. For consistency, the same Wikipedia pages that were used in the competition have been used in this dataset as well. The colons (:) in article names have been replaced by dashes (-) to make the .tsf file readable using our data loaders.

The original dataset contains missing values. They have been simply replaced by zeros.

You can download the dataset from https://zenodo.org/record/7371038/files/web_traffic_extended_dataset_without_missing_values.zip. Note that the file is large: 433 MB. The cells convert the dataset into a Numpy compressed file for easier use in the remainder of this jupyter notebook.

“这个数据集包含145063个时间序列，代表2015-07-01至2022-06-30期间一组维基百科页面的点击量或网络流量。这是在Kaggle维基百科网络流量预测比赛中使用的数据集的扩展版本。为了保持一致性，比赛中使用的维基百科页面也被用在这个数据集中。文章名称中的冒号（:）已被破折号（-）取代，以使.tsf文件可以用我们的数据加载器读取。

In :

# https://zenodo.org/record/7371038

df, frequency, forecast_horizon, contain_missing_values, contain_equal_length = convert_tsf_to_dataframe(

“web_traffic_extended_dataset_without_missing_values.tsf”

)

hits_arr = np.stack(df[“series_value”].apply(np.asarray).tolist())

np.savez_compressed(“web_traffic_extended_dataset_without_missing_values”, hits_arr)

The following cell implement the pct_change1 function, used to compute the percentage difference change beetween succesive values, as aforementioned. The cell is also used to time the function and display the result as a figure.

In [ ]:

def pct_change1(n):

result = np.diff(n) / n[:, :1]

result[np.isnan(result)] = 0  # difference is 0 and both values are 0

result[np.isinf(result)] = 1  # first value was 0 and next is not 0

return result​

pct = pct_change1(hits_arr)​

%timeit pct_change1(hits_arr)​

plt.figure(figsize=(14, 5))

plt.plot(pct[10559, 300:800])

### Question 1 [20 marks]

Modify the following cell to implement the function pct_change2 to accelerate the computation. The np.allclose() function checks that the results obtained using pct_change1 and pct_change2 are near identical.

In [ ]:

def pct_change2(n):

result = None

#TODO

return result​

pct_n = pct_change2(hits_arr)​

%timeit pct_change2(hits_arr)​

print(np.allclose(pct, pct_n))

The cumsum function from numpy can be used to compute the cummulative sum of the hits_arr array, as shown below.

numpy的cumsum函数可以用来计算hits_arr数组的累积和，如下所示。

In [ ]:

cs1 = np.cumsum(hits_arr, 1)

%timeit np.cumsum(hits_arr,1)

### Question 2 [10 marks]

Modify the following cell to implement the function cumsum2 to accelerate the computation. The np.allclose() function enables to check that the results obtained using np.cumsum and cumsum2 are similar.

In [ ]:

def cumsum2(n):

result = None

#TODO

return result

cs2 = cumsum2(hits_arr)

%timeit cumsum2(hits_arr)

np.allclose(cs1, cs2)

## Exercise 2

The aim of the exercise is to develop a effective solution to solve the travelling salesman problem.

The idea of the Travelling Salesman Problem is to find the shortest path necessary to visit all locations and return to the start. For an example of visiting four locations, cycle length is the distance travelled from starting location A to location B, then to location C, then to location D, then back to A. The brute force approach to solving this is consider every possible cycle between the locations, however for N locations there are N! possible cycles so this isn’t really tractable for anything above small values like 10.

The following cell generate a random graph with 100 nodes, links the nodes in indices increasing order (to create edges between pair of nodes) and display the resulting solution. E-mail: vipdue@outlook.com  微信号:vipnxx 