# Python辅导 | Linear Regression for Gene Expression Prediction

Lab1: Regression

Part 1:  Linear Regression for Gene Expression Prediction (40 points)

There are ~20,000 genes in the human genome. Each one of them is transcribed to mRNA and then translated to proteins which carry on various tasks inside our body. We can measure the amount of 20,000 mRNA expressed in samples collected from different organs. This collection is called gene expression profile.

Although our genome is the same across all cell types, the gene expression profile is different because each organ needs different proteins for its survival. One of the regulatory mechanisms which controls the expression level in each cell type is microRNA (miR). MicroRNAs are small molecules which attach to mRNAs and prevent them from translation to proteins and also degrade them.

So if miR A targets mRNA B when A increases B decreases. Our goal is to predict mRNA levels (gene expression profile) using 21 miR features. Note that each of the 20,000 expression levels can be a response of regression with 21 features. To simplify, we have selected a few genes to predict their expression.

Your job will be to investigate how well the miR values predict the mRNA values.

You are recommended to use the sklearn.linear_model package to conduct linear regression experiments, but you may use other packages if you wish.

You will need the following data files:

Load the provided data and implement the code required for the following steps:

You should randomly divide the samples into 80/20 training/test splits, and repeat the experiment 10 times to give mean and standard deviation of the metrics.

Predict all of the well-expressed and poorly expressed genes by a linear regression predictor for each mRNA. You should solve 35 linear regression problems.

Report on the following:

1.  The mean and standard deviation for each of the mRNA predictors for both the R2 and RMSE metrics.
2.  Visualize and compare the performance of the well expressed gene set to the poorly expressed gene set using R2.  Draw histograms of the R2s for both sets on the same plot; one histogram comparison should be done for train and one for test.

Describe the differences you see across the well and poorly expressed gene sets.

1. In this part, we want to add a categorical feature as the 22nd predictor. Tissue type is an important factor in explaining the gene expression profile. Our samples come from 33 tissue types which are provided to you in a separate file. Use dummy variable coding to include the tissue type in your regression.

In dummy variable coding of categorical variable X with n levels, we add n – 1 columns to our features. The first level is coded as zero and then for each level, we set one of the columns to 1. For example, if we have a categorical feature for “Direction” with four levels “South, West, North, East” the following codes are required:

 West North East Code if Direction = South 0 0 0 Code if Direction = West 1 0 0 Code if Direction = North 0 1 0 Code if Direction = East 0 0 1

So for 33 levels for the “Tissue” feature, you need to add 32 columns to your feature (design) matrix. With the newly added feature run the linear regression again with the 80/20 split and report any change in prediction performance of your model and explain it.

Part 2: Logistic Regression (40 points)

In this exercise, you will implement logistic regression by gradient descent.  You should not use off the shelf logistic regression solvers for this problem.  This will also exercise your data skills, so you may want to read up on the pandas toolkit if you wish to use python.

Auxiliary notes: Logistic regression for binary prediction

Problem: you are given a dataset of 400 people; half female/half male, also half of the people are basketball players and half are not.   The data has three features: height (inches), weight (pounds), and female (0=male, 1=female).  The variable you want to predict is basketball player (0=non-player, 1=player).

CSV file (Links to an external site.)

Implement gradient descent for logistic regression.  You may want to consult the notes in the regression model on logistic regression for help.  Train the model on 80% of the data, reserving 20% for the test set.

1. Train the model first to predict the probability of basketball given height.   Evaluate on the test set in a few ways:
• Compute the average loss on the test set −1N∑i−1Nlog⁡(targeti⋅P(predictioni)+(1−targeti)(1−P(predictioni)))
(Note this is just a clever way to say use P(prediction) when target is 1, 1-P(prediction) when target is 0.)
• Compute the accuracy on the test set by predicting someone is a basketball player if P(prediction) > 0.5.
• Plot the training data as well as the learned logistic regression function.
1.  Now train the model to be gender dependent by incorporating both the height and female features.  Evaluate on the same test set with average loss and accuracy.  Plot the logistic regression function across nights for male and female – do the learned functions make sense relative to one another?
2. Incorporate the weight feature (training (height, weight) and (height, weight, gender)). Evaluate on average loss and accuracy.  Does weight help as a feature?

Submission:

Submit all files needed for the TA to grade.  You can choose one of two methods:

• iPython notebook: you can document your code and provide the written answers within the iPython notebook.  Please indicate your name at the top.  Also, please submit the fileand not a link to the file if you are using Google colab (or other external services).
• Zip archive: make sure to include both your writeup and the code, as well as instructions on how to execute the code.  If you are not using python, please make sure that the TA has the ability to evaluate your code BEFORE starting the assignment.

You may ask colleagues for general understanding assistance but do not share code.  You may start, however, from the hands-on code as a jumping off point.  Please do not copy code from the internet in developing your answer.

E-mail: vipdue@outlook.com  微信号:vipnxx