Notes ggplot

============================ GGPlot Architecture & Design

The basic structure of the GGPlot package is that there is a data definition and basic visual parameter configuration (aka a "template") which can be attached to a "ggplot" object. Then various geometry specifications are added to it, like geom_point and geom_line, and they inherit the definition from their upstream pipeline node, and potentially override any parameter with arguments they were given.

Data parameters and visual parameters are both encapsulated in the "aesthetic" (which makes sense in kind of a bizarre way), and are injected into the pipeline by passing the return value of a call to aes() as an argument into either the ggplot() or geom_*() functions.

It should be noted that different kind of geoms can have different aesthetic parameters. So the aesthetic parameters that don't apply to a given geom is just ignored. The aesthetic can be thought of as a stylesheet on which geometry just looks up visual attributes.

Useful information can be found in: http://www.ling.upenn.edu/~joseff/rstudy/summer2010_ggplot2_intro.html

Aesthetics

Whereas facets are data-driven layout, the Aesthetic specification system in ggplot is a data-driven stylesheet mechanism. Renderers that reference the aesthetic will use its visual parameters. Additionally, by assigning categorical data nodes out of the datagraph to visual parameters, the Aesthetic hierarchy further expands the number of unique renderer configurations that appear in the plot, just as giving a categorical data node to facet_wrap causes a multiplicity of renderers to be invoked.

The same group= argument for aes() doesn't apply to facet() because whereas there is an obvious way to visually merge two aesthetically different renderers/geoms in the same panel, there is not a similar kind of guidance on how two panels should always line up. Hence, facet_grid and facet_wrap hardcode particular kinds of inter-panel alignment.

Recognizing that the role of aes() is two fold, we can tackle one of its roles, the rendering generation part, by using list comprehensions or some other vectorial expression. This same vectorial expression/expansion mechanism can then be used to specify a multiplicity of panels, in a way that is more flexible than just facet_grid and facet_wrap.

Plots convey information through various aspects of their aesthetics. Some aesthetics that plots use are: x position y position size of elements shape of elements color of elements

The elements in a plot are geometric shapes, like points lines line segments bars text

Some of these geometries have their own particular aesthetics. For instance: points point shape point size lines line type line weight bars y minimum y maximum fill color outline color text label value

There are other basics of these graphics that you can adjust, like the scaling of the aesthetics, and the positions of the geometries.

Statistics

The values represented in the plot are the product of various statistics. If you just plot the raw data, you can think of each point representing the identity statistic. Many bar charts represent the mean or median statistic. Histograms are bar charts where the bars represent the binned count or density statistic.

To add stats annotations to a plot, you use the stats_*() functions. These process the input data and then render them with a default geom, which can be changed.

Grouping

If a categorical factor is used for any of the aes() parameters, then the aes node is tagged as having a grouping, and that grouping is used by all downstream geoms and stats to vectorize over the appropriate subsetting of the source data.

Different aesthetic parameters can each have their own grouping; in that case, groups are defined as unique combinations of each set of factors.

A grouping can also be explicitly created with the group= parameter for aes(). In this case, no visual attributes are being specified, but the geoms and stats are being vectorially created over some sharding of the dataset.

Faceting

Two faceting functions just control layout: facet_wrap() and facet_grid(). facet_wrap() is a one-factor form that uses the given factor to produce a 2D wrapping of a 1D array of plots. A 2D grid of plots can be created using facet_grid(factorA, factorB).

They will share axes by default; to free this restriction, just pass in the 'scales="free"' parameter. If you only want one or the other axis to be free, then use "free_y" or "free_x".

Scales

scale_*() functions share a common set of arguments: name limits (min, max) breaks (labeled breaks for the data) labels (labels for the breaks) trans (transformations to use on the data)

Scale functions are formatted as: scale_AESTHETIC-NAME_SCALE-NAME()

So: p <- ggplot(mpg, aes(displ, hwy)) + geom_point()

p + scale_x_continuous(label = "Engine Displacement in Liters")
#or
p + xlab("Engine Displacement in Liters")

p + scale_x_continuous(limits = c(2,4))
#or
p + xlim(2, 4)

p + scale_x_continuous(trans = "log10)
#or
p + scale_x_log10()

It's important to note that functions like xlim() actually throw away data.

For color and fill scales:

p <- p + aes(color = factor(cyl))
p + scale_color_hue(label = "Cylinders")
    
scale_color_brewer(pal = 'palette_name')
scale_fill_gradient(high="colorname", low="colorname")

Additional notes

These are random notes as I'm understanding how ggplot is put together (and also learn a bit more concretely about GG itself).

The geom_abline example shows how if the data supplied to the geom function is vector with regards to the dimensionality of the geom, then it broadcasts. I wonder how R is implementing this under the hood, or if it's some special case code.

Some functions have parameters that control the widget's appearance, and which can oftentimes accept any aesthetic.

Data(mydata): PanelGrid: cols: f1 # These are attributes of the PanelGrid itself rows: f2

  # These aesthetic parameters are templates for the geoms contained
  # within the PanelGrid.  There is a separate instance of each geom
  # for each of the factors in each of the aesthetic parameters below.
  # When multiple parameters are faceted like this, there is a cartesian
  # product of possibilities created.
  color: f3
  size: f4

  # To avoid the Cartesian product/combinatoric explosion of different
  # factors, use the set algebra operators:
  color: f3 + f4
  size: f3 + f4

  # Alternatively:
  color, size, linedash: f3 / f4

  ## It is also possible to define an explicit named grouping 

  # Basic plot:
  Dot:
    x: m1
    y: m2

  # Inheritance from other plots, via naming
  scatter = Dot(x=m1, y=m2)
  Wedge(scatter):   # inherits x,y from scatter
    r1 = m3
    r2 = m4
    color = m5

  # Composition - referencing other instances of a data series
  Area:
    left: m1
    right: m1+10
    
    bottom: m2+parent(PanelGrid).index

  # Related panels - walk up the containment hierarchy

with bokeh.Data(mydata):
    grid = bokeh.facet_grid(cols=factor1, rows=factor2)
    grid.style(color=factor3, size=factor4)

    # Alternatives:
    # grid.style(bokeh.style(color=factor3+factor4))
    # grid.style(bokeh.style(color=bokeh.blend(f3, f4)))

    d = bokeh.dot(x=m1, y=m2)
    a = bokeh.area(left=d.x, top=d.y)
    grid.add(d, a)

    # Alternatively:
    d = panel.dot(x=m1, y=m2)
    grid.area(left=d.x, top=d.y)
        

# Alternatively, without using withhacks:
scene = bokeh.scene(mydata)
grid = scene.facet_grid(cols="factor1", rows="factor2")

Creating a dot plot and applying it to a rendering site with greater dimensionality will automatically vectorize it over the outer dimensions:

scene = bokeh.scene(mydata)

# Assuming mydata has fields f1, f2, m1, m2, then the following
# code would produce a single scatter plot with all the m1 and m2
# values from all f1 and f2 values:
scene.add(bokeh.dot(x="m1", y="m2"))

# The following would create several colored scatter plots in the
# same panel, with colors corresponding to the factors of f1
d = bokeh.dot(x="m1", y="m2")
scene.style(color="f1")
scene.add(d)

Since so much of the code operates on objects that are representing the scene graph, there is an explicit command to trigger rendering of the scene:

scene.render()

This causes the scene graph to be processed, new viewmodels to be created, and actual data to be loaded and rendered.

For animated and interactive purposes, this sort of one-shot rendering is not the most efficient model, and it should instead be broken down into two different things:

scene.create_pipeline()
scene.update_pipeline()

The latter is meant to mostly deal with just the data updates in the system, and does not need to do some of the object creation that is in create_pipeline. It mainly pumps new data to the actual output renderers, and then requests a repaint from the host graphics/GUI system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly