# Python代写｜Data 102 Assignment 4

• ALL

Overview

Submit your writeup including all code and plots as a PDF via Gradescope.1 We recom
mend reading through the entire homework beforehand and carefully using functions for
testing procedures, plotting, and running experiments. Taking the time to test, maintain,
and reuse code will help in the long run!

Data science is a collaborative activity. While you may talk with others about the
homework, please write up your solutions individually. If you discuss the homework with
written answers are legible, as we may deduct points otherwise.

Please note that this homework is slightly shorter than usual, to give you
time to start working on your project.

1 Observational Data on Infant Health

The Infant Health and Development Program (IHDP) was an experiment treating low
birth-weight, premature infants with intensive high-quality childcare from a trained provider.

The goal is to estimate the causal effect of this treatment on the child’s cognitive test
scores. The data does not represent a randomized trial with randomly allocated treat
ment, so there may be confounders between treatment and outcome. In this problem, we
devise a propensity score model to control for observed confounders.

(a) (2 points) The CSV file ihdp.csv has 27 columns:

• Column 1 is the treatment zi ∈ {0, 1}, which indicates whether or not the treat
ment was given to the infant.
• Column 2 is the outcome yi ∈ R, the child’s cognitive test score.
• Columns 3-27 contain 25 features of the mother and child (e.g. the child’s birth
weight, whether or not the mother smoked during pregnancy, her age and race).
Since this dataset was not collected by a randomized trial, these features could
all confound zi and yi, and are denoted by xi ∈ R25.

In this part, you’ll estimate ˆ e(x) (the predicted probability that zi = 1) by fitting a
logistic regression model that predicts zi from xi. Specifically:

1. Read the data in ihdp.csv (e.g. using the csv package in Python) into three
arrays: Z ∈ {0, 1}n containing the treatments, Y ∈ Rn containing the outcomes,
and X ∈ Rn×25 containing the features.

2. To fit a logistic regression model, use the scikit-learn package in Python,