# Estimation of conditional density distributions

Assume we have temperature data for a given location and we want to predict today’s temperature at that location using yesterday’s temperature. More generally, the problem discussed in this blog post can be stated as “How to estimate the conditional density of the predicted variable given a value of the conditioning covariate?”

One way to answer this question is to provide a family of regression quantiles and from them to estimate the Cumulative Distribution Function (CDF) for a given value of the covariate. In other words, given data pairs Subsuperscript[{Subscript[x, i],Subscript[y, i]}, i=1, n] and a value Subscript[x, 0] we find Subscript[CDF, Subscript[x, 0]](y). From the estimated CDF we can estimate the Probability Density Function (PDF). We can go further and use Monte Carlo type of simulations with the obtained PDF’s (this will be discussed in another post).

The experiments in this blog post follow the example sub-sub-section “Daily Melbourne Temperatures” of the book “Quantile regression” by Roger Koenker.

Consider the temperature time series of Atlanta, GA, USA from 2006.01.01 to 2014.01.13 :
``` location = {"Atlanta", "GA"}; tempData = WeatherData[location, "Temperature", {{2006, 1, 1}, {2014, 1, 12}, "Day"}]; tempData // Length ```

We can see that this time series is heteroscedastic — the range of temperatures is wider in the winter months.

Using the time series data let us make pairs of yesterday and today data:
``` tempPData = Partition[tempData[[All, 2]], 2, 1]; ```
(We pair up two consecutive temperatures.)

Let us calculate the regression quantiles for a number of quantiles. (The package QuantileRegression.m used is discussed in previous posts of this blog, and it can be downloaded from MathematicaForPrediction at GitHub.)
Get[“~/MathFiles/MathematicaForPrediction/QuantileRegression.m”]

The considered set of quantiles is qs:={0.02,0.05,0.01,…,0.95,0.98}:
``` In[397]:= qs = N@Join[{0.02}, FindDivisions[{0, 1}, 20][[2 ;; -2]], {0.98}] qs // Length```

``` ```

```Out[397]= {0.02, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98} Out[398]= 21 ```

Next we calculate the regression quantiles using 3rd order B-spline basis functions over 5 knots.
``` In[399]:= AbsoluteTiming[qFuncs = QuantileRegression[tempPData, 5, qs];] ```

Given a temperature value, say t0=2°C, we can estimate CDF(t) using the quantiles and the values at t0 of the corresponding regression quantile functions.
``` t0 = 2; xs = Through[qFuncs[t0]]; cdfPairs = Transpose[{xs, qs}]; ```

We can get a first order (linear) approximation by simply connecting the consecutive points of {x[i],y[i]}, 1 <= i <= |qs| as it is shown on the following plot.

On the plot the dashed vertical grid lines are for the quantiles {0.05,0.25,0.5,0.75,0.95}; the solid vertical gray grid line is for Subscript[t, 0].

Let us define a functions of the CDF and PDF approximation and plots:
``` CDFEstimate[t0_] := CDFEstimate[qs, qFuncs, t0]; CDFEstimate[qs_, qFuncs_, t0_] := Interpolation[Transpose[{Through[qFuncs[t0]], qs}], InterpolationOrder -> 1];```

``` ```

```CDFPDFPlot[t0_?NumberQ, qCDFInt_InterpolatingFunction, qs : {_?NumericQ ..} : qs, opts : OptionsPattern[]] := Block[{qsGL = {0.05, 0.25, 0.5, 0.75, 0.95}, xsGL}, xsGL = Pick[Flatten@qCDFInt["Grid"], MemberQ[qsGL, #] & /@ qs]; Plot[{qCDFInt[x], qCDFInt'[x]}, {x, qCDFInt["Domain"][[1, 1]], qCDFInt["Domain"][[1, 2]]}, GridLines -> {Prepend[ MapThread[{#2, {Dashed, Blend[{Blue, Green, Pink}, #1]}} &, {qsGL, xsGL}], {t0, GrayLevel[0.6]}], None}, Axes -> False, Frame -> True, PlotLabel -> "Estimated CDF and PDF for " ToString[t0] "\[Degree]C",opts] ]; ```

Using these definitions let us plot the estimated CDF’s and PDF’s for a collection of temperatures. Note that the quantiles are given with vertical dashed grid lines; the median is colored with green. The values of the conditioning covariate are given on the plot labels and are marked with a solid, gray vertical line.

In the grid of CDF-and-PDF plots we can see that:
1. when the temperature is lower there is higher probability that the next day the temperature is going to be higher,
2. when yesterday’s temperature t0 is within the range [10,25]C the median almost coincides with t0,
3. for medium and high temperatures we have distributions similar to the skew normal distribution.

As I mentioned in the beginning of the post I followed the exposition for Melbourne’s temperatures in Koenker’s book. I did the calculations described above with temperature time series data for Melbourne. Here is the plot with the fitted regression quantiles:

Here is the grid of plots over a collection of temperatures. Note that they bring different conclusions that the ones for Atlanta, GA, USA.