Estimation of conditional density distributions

Assume we have temperature data for a given location and we want to predict today’s temperature at that location using yesterday’s temperature. More generally, the problem discussed in this blog post can be stated as “How to estimate the conditional density of the predicted variable given a value of the conditioning covariate?”

One way to answer this question is to provide a family of regression quantiles and from them to estimate the Cumulative Distribution Function (CDF) for a given value of the covariate. In other words, given data pairs Subsuperscript[{Subscript[x, i],Subscript[y, i]}, i=1, n] and a value Subscript[x, 0] we find Subscript[CDF, Subscript[x, 0]](y). From the estimated CDF we can estimate the Probability Density Function (PDF). We can go further and use Monte Carlo type of simulations with the obtained PDF’s (this will be discussed in another post).

The experiments in this blog post follow the example sub-sub-section “Daily Melbourne Temperatures” of the book “Quantile regression” by Roger Koenker.

Consider the temperature time series of Atlanta, GA, USA from 2006.01.01 to 2014.01.13 :

location = {"Atlanta", "GA"};
tempData = WeatherData[location, "Temperature", {{2006, 1, 1}, {2014, 1, 12}, "Day"}];
tempData // Length

Atlanta temperature time series 2006-2014

We can see that this time series is heteroscedastic — the range of temperatures is wider in the winter months.

Using the time series data let us make pairs of yesterday and today data:

tempPData = Partition[tempData[[All, 2]], 2, 1];

(We pair up two consecutive temperatures.)

Atlanta yesterday-vs-today temperatures

Let us calculate the regression quantiles for a number of quantiles. (The package QuantileRegression.m used is discussed in previous posts of this blog, and it can be downloaded from MathematicaForPrediction at GitHub.)

The considered set of quantiles is qs:={0.02,0.05,0.01,…,0.95,0.98}:

In[397]:= qs = N@Join[{0.02}, FindDivisions[{0, 1}, 20][[2 ;; -2]], {0.98}]
qs // Length

Out[397]= {0.02, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98}
Out[398]= 21

Next we calculate the regression quantiles using 3rd order B-spline basis functions over 5 knots.

In[399]:= AbsoluteTiming[qFuncs = QuantileRegression[tempPData, 5, qs];]

Atlanta regression quantiles for yesterday-vs-today temperatures

Given a temperature value, say t0=2°C, we can estimate CDF(t) using the quantiles and the values at t0 of the corresponding regression quantile functions.

t0 = 2;
xs = Through[qFuncs[t0]];
cdfPairs = Transpose[{xs, qs}];

We can get a first order (linear) approximation by simply connecting the consecutive points of {x[i],y[i]}, 1 <= i <= |qs| as it is shown on the following plot.

Atlanta estimated CDF for 2C

On the plot the dashed vertical grid lines are for the quantiles {0.05,0.25,0.5,0.75,0.95}; the solid vertical gray grid line is for Subscript[t, 0].

Let us define a functions of the CDF and PDF approximation and plots:

CDFEstimate[t0_] := CDFEstimate[qs, qFuncs, t0];
CDFEstimate[qs_, qFuncs_, t0_] :=
Interpolation[Transpose[{Through[qFuncs[t0]], qs}], InterpolationOrder -> 1];

CDFPDFPlot[t0_?NumberQ, qCDFInt_InterpolatingFunction,
qs : {_?NumericQ ..} : qs,
opts : OptionsPattern[]] :=
Block[{qsGL = {0.05, 0.25, 0.5, 0.75, 0.95}, xsGL},
xsGL = Pick[Flatten@qCDFInt["Grid"], MemberQ[qsGL, #] & /@ qs];
Plot[{qCDFInt[x], qCDFInt'[x]}, {x, qCDFInt["Domain"][[1, 1]], qCDFInt["Domain"][[1, 2]]},
GridLines -> {Prepend[
MapThread[{#2, {Dashed, Blend[{Blue, Green, Pink}, #1]}} &, {qsGL, xsGL}], {t0, GrayLevel[0.6]}], None},
Axes -> False, Frame -> True,
PlotLabel -> "Estimated CDF and PDF for " ToString[t0] "\[Degree]C",opts]

Using these definitions let us plot the estimated CDF’s and PDF’s for a collection of temperatures. Note that the quantiles are given with vertical dashed grid lines; the median is colored with green. The values of the conditioning covariate are given on the plot labels and are marked with a solid, gray vertical line.
Atlanta estimated CDF's and PDF's

In the grid of CDF-and-PDF plots we can see that:
1. when the temperature is lower there is higher probability that the next day the temperature is going to be higher,
2. when yesterday’s temperature t0 is within the range [10,25]C the median almost coincides with t0,
3. for medium and high temperatures we have distributions similar to the skew normal distribution.

As I mentioned in the beginning of the post I followed the exposition for Melbourne’s temperatures in Koenker’s book. I did the calculations described above with temperature time series data for Melbourne. Here is the plot with the fitted regression quantiles:
Melbourne regression quantiles for yesterday-vs-today temperatures

Here is the grid of plots over a collection of temperatures. Note that they bring different conclusions that the ones for Atlanta, GA, USA.
Melbourne estimated CDF's and PDF's


2 thoughts on “Estimation of conditional density distributions

  1. Pingback: Simple time series conversational engine | Mathematica for prediction algorithms

  2. Pingback: A monad for Quantile Regression workflows | Mathematica for prediction algorithms

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.