Skip to content

Statistics, Julia package: check data for a Simpson's statistical paradox.

License

Notifications You must be signed in to change notification settings

wherrera10/Simpsons.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simpsons.jl

CI

Julia module to check data for a Simpson's statistical paradox

Usage

using Simpsons

has_simpsons_paradox(df, cause, effect, factor; continuous_threshold = 5, cmax = 5, verbose = true)

Returns true if the data in DataFrame df aggregated by factor exhibits Simpson's paradox. Note that the cause and effect columns will be converted to Int columns if they are not already numeric in type. A continuous data factor column (one with continuous_threshold or more discrete levels) will be grouped into at most cmax clusters so as to avoid too many clusters. Prints the regression slope directions for overall data and groups if verbose is true.


simpsons_analysis(df, cause_column, effect_column; verbose = true, show_plots = true)

Analyze the DataFrame df assuming a cause is in cause_column and an effect in effect_column of the dataframe. Output data including any Simpson's paradox type first degree slope reversals in subgroups found. Plots shown if show_plots is true (default).


make_paradox(nsubgroups = 3 , N = 1024)

Return a dataframe containing N rows of random data in 3 columns :x (cause), :y (effect), and :z (cofactor) which displays the Simpson's paradox.


plot_clusters(df, cause, effect)

Plot, with subplots, clustering of the dataframe df using cause and effect plotted and color coded by clusterings. Use kmeans clustering analysis on all fields of dataframe. Use 2 to 5 as cluster numbers.


plot_kmeans_by_factor(df, cause_column, effect_column, factor_column)

Plot clustering of the dataframe using cause plotted as X, effect as Y, with the factor_column used for kmeans clustering into between 2 and 5 clusters on the plot.


find_clustering_elbow(dataarray::AbstractMatrix{<:Real}, cmin = 1, cmax = 5; fclust = kmeans, kwargs...)

Find the "elbow" of the totalcost versus cluster number curve, where cmin <= elbow <= cmax. Note that in pathological cases where the actual minimum of the totalcosts occurs at a cluster count less than that of the curve "elbow", the function will return either cmin or the actual cluster count at which the totalcost is at minimum, whichever is larger.
Returns a tuple: the cluster count and the ClusteringResult at the "elbow" optimum.


Examples

using Simpsons

# Create a dataframe with cause :x, effect :y, and cofactor :z columns
dfp = make_paradox()

# Test for a Simpson's paradox, where the regression direction of :x with :y 
#    reverses if the data is split by factor :z.
has_simpsons_paradox(dfp, :x, :y, :z)  # true with this data

# Analyze with plots made of data clustering. 
# To see the plots, run in REPL to prevent premature display closure. 
simpsons_analysis(dfp, :x, :y)



Installation

Install the package using the package manager (Press ] to enter pkg> mode):

(v1) pkg> add Simpsons