Get 15% off Today, Grab Now!

ST2195 Coursework Project

Check Our Pricing

Our Process

Get Paper Done In 3 Simple Steps

Place an order

Visit the URL and place your order with us. Fill basic details of your research paper, set the deadlines and submit the form.

Place an Order

Make payments

Chat with our experts to get the best quote. Make the payment via online banking, debit/credit cards or through paypal. Recieve an order confirmation number.

Receive your paper

Sit back and relax. Your well written, properly referenced research paper will be mailed to your inbox, before deadline. Download the paper. Revise and Submit.

Evan John
5 min read

ST2195 Coursework Project

Instructions to candidates
This project contains two questions. Answer BOTH questions. All questions will be given equal
weight (50%).
Part 1 In this part, you are asked to work with the Markov Chain Monte Carlo algorithm, in
particular the Metropolis-Hastings algorithm. The aim is to simulate random numbers
for the distribution with probability density function given below
f(x) = 1
2 exp(−|x|),
where x takes values in the real line and |x| denotes the absolute value of x. More
specifically, you are asked to generate x0,x1,…,xN values and store them using the
following version of the Metropolis-Hastings algorithm (also known as random walk
Metropolis) that consists of the steps below:
Random walk Metropolis
Step 1 Setupaninitial value x0 as well as a positive integer N and a positive real number s.
Step 2 Repeat the following procedure for i = 1,…,N:
• Simulate a random number x∗ from the Normal distribution with mean xi−1 and
standard deviation s.
• Compute the ratio
r (x∗,xi−1) = f (x∗)
f (xi−1) .
• Generate a random number u from the uniform distribution between 0 and 1.
• If u < r(x∗,xi−1), set xi = x∗, else set xi = xi−1.
(a) Apply the random walk Metropolis algorithm using N = 10000 and s = 1. Use the
generated samples (x1,…xN) to construct a histogram and a kernel density plot in
the samefigure. Notethattheseprovideestimatesoff(x).Overlayagraphoff(x)on
this figure to visualise the quality of these estimates. Also, report the sample mean
andstandard deviation of the generated samples (Note: these are also known as the
Monte Carlo estimates of the mean and standard deviation respectively).
Practical tip: To avoid numerical errors, it is better to use the equivalent criterion
log u < logr(x∗,xi−1) = logf (x∗) −logf (xi−1) instead of u < r(x∗,xi−1).
ST2195 Coursework Project
Page 1 of 3
(b) The operations in part 1(a) are based on the assumption that the algorithm has
converged. One of the most widely used convergence diagnostics is the so-called R
value. In order to obtain a valued of this diagnostic, you need to apply the
procedure below:
0 ,x(j)
• Generate more than one sequence of x0,…,xN, potentially using different
initial values x0. Denote each of these sequences, also known as chains, by
(x(j)
1 ,…,x(j)
N )for j = 1,2,…,J.
• Define and compute Mj asthesample meanofchain j as
Mj = 1
N
N
i=1
x(j)
i .
and Vj as the within sample variance of chain j as
Vj = 1
N
N
i=1
(x(j)
i
−Mj)2.
• Define and compute the overall within sample variance W as
W = 1
J
J
j=1
Vj
• Define and compute the overall sample mean M as
M= 1
J
J
j=1
Mj,
and the between sample variance B as
B = 1
J
J
j=1
(Mj −M)2
• Compute the R value as
R=
B+W
W
In general, values of R close to 1 indicate convergence, and it is usually desired for R
to be lower than 1.05. Calculate the R for the random walk Metropolis algorithm with
N =2000, s = 0.001 and J = 4. Keeping N and J fixed, provide a plot of the values
of Rover a grid of s values in the interval between 0.001 and 1.
Page 2 of3
ST2195 Coursework Project
Part 2 The 2009 ASA Statistical Computing and Graphics Data Expo consisted of flight arrival
and departure details for all commercial flights on major carriers within the USA from
October 1987 to April 2008. This is a large dataset; there are nearly 120 million records
in total, and it takes up 1.6 gigabytes of space when compressed and 12 gigabytes when
uncompressed. Thecompletedataset,alongwithsupplementaryinformationandvariable
descriptions, can be downloaded from the Harvard Dataverse at
https://doi.org/10.7910/DVN/HG7NV7
Chooseanysubsetoffiveconsecutiveyears(e.g.1995-1999or2004-2008)anduseanyof
the supplementaryinformationprovidedbytheHarvardDataversetoanswerthefollowing
questions using the principles and tools you have learned in this course:
(a) What are the best times and days of the week to minimise delays each year?
(b) Evaluate whether older planes suffer more delays on a year-to-year basis.
(c) For each year, fit a logistic regression model for the probability of diverted US flights
using as manyfeatures as possible from attributes of the departure date, the sched
uled departure and arrival times, the coordinates and distance between departure
and planned arrival airports, and the carrier. Visualize the coefficients across years.
General Instructions
• All questions should be answered using R and Python for all tasks.
• Your answers should be provided in a separate structured report of no more than 2
pages for part 1, and no more than 6 pages for part 2. The page limit excludes title,
references and table of contents but includes graphics and tables. The report should be
in PDF format and also contain adequate explanations for readers not familiar with
programming. In addition to the report, you will also be asked to provide your R and
Python code in RMarkdown and Jupyter notebooks, respectively. All the relevant files
must be submitted in the designated Canvas or VLE submission portal.
• For part 2, each report should detail all steps you took starting from raw data up to the
answer for each question. Any databases you set up, data wrangling/cleaning operations
you carry out, and any modelling decisions you make should be clearly described in each
structured report. Each report should also include any relevant graphics and tables as part
of the answer.
• If you are using elements (e.g. code, databases, graphics, etc) from your answer to a
previous question to answer the current one, you will need to refer to those elements.
• You should also supply the code you used to answer each question, in a way that can be
used by someone else to replicate your analyses. You can do this either as separate
scripts or separate RMarkdown/Jupyter notebooks per question, clearly indicating (both
with comments and in the filename) which question each script refers to.
ST2195 Coursework Project
Page 3 of3