Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Improvements to measures #17

Open
ablaom opened this issue May 11, 2020 · 32 comments
Open

[Tracking] Improvements to measures #17

ablaom opened this issue May 11, 2020 · 32 comments

Comments

@ablaom
Copy link
Member

ablaom commented May 11, 2020

edit See this important issue

The measures part of MLJBase could do with some TLC. It is not the shiniest part of the MLJ code base, written in a bit of a hurry because nothing much could go forward without something in place, and the existing packages came up short.

I think the API is more-or-less fine, but the way things are implemented is less that ideal, leading to:

(i) code redundancy
(ii) less functionality: measures that could support weights or implement reports_each_observation don't

Recall that a measure reports_each_observation means m(v1, v2) returns a vector of measurements, and otherwise a single scalar is returned. So it does't really make sense for auc, for example, to report_each_observation (which it doesn't). However, mae should (but doesn't).

I propose we make the following assumption that will allow us to resolve these issues for the majority of measures:

If a measure m(v1, v2) implements reports_each_observation then it is understood that it is the sum or mean value of some scalar version m(s1, s2).

For such measures, then, we need only implement the scalar method m(s1, s2) and we can generate the other methods m(v1, v2), m(v1, v2, w) automatically.

For other measures, such as auc and the rms family, m(v1, v2) (and optionally m(v1, v2, w)) must be explicitly implemented, as at present.

In addition to the docs there is a lot about the measure design in this discussion.

Details

To "automatically generate" the extra methods, we could do something like this:

# fallbacks for measures
(m::Measure)(yhat, y::AbstractVector) = _eval(Val(reports_each_observation(m)), m, yhat, y)
(m::Measure)(yhat, y::AbstractVector, w) = _eval(Val(reports_each_observation(m)), m, yhat, y, w)
_eval(::Val{false}, m, args...) = m(yhat, args...)
_eval(::Val{true}, m, y, yhat) = (m(yhat, y)) |> aggregation(m)
_eval(::Val{true}, m, y, yhat, w) = w .* (m(yhat, y)) |> aggregation(m)

supports_measures(m::Measure) = _sm(Val(supports_each_observation(m), m)
_sm(::Val{false}, m) = false
_sm(::Val{true}, m) = true

@tlienart
@azev77

@ablaom
Copy link
Member Author

ablaom commented May 11, 2020

Decision:

What do we do about measures like mape where we want to drop some terms where the computation is unstable? That is, what do we do if we are now reporting a value for every observation?

One option is return a missing there and make sure the aggregators do skipmissing. There is no skipnan, which might be more natural.

A related question is whether instead of returning a single value, in the case of reports_each_observation=false (eg, auc) we instead report a constant vector. This would eliminate some bothersome case distinctions, but might also be confusing.

@ablaom
Copy link
Member Author

ablaom commented May 11, 2020

Some other improvements on my wish list:

  1. Export all measure types (such as RMS, CrossEntropy, and so forth) and always use the explicit instantiatioins (such as RMS(), CrossEntropy(eps=1e-7), BrierScore(distribution=Normal), etc) in documentation, rather than rms, cross_entropy. I think keeping the aliases is fine, but their use in docs has hidden the fact that some measures depend on parameters, and that these parameters must be decided on instantiation of the measure, not when they are called on data. This is in line with the LossFunctions.jl package. I think this point has confused several people. edit This is essentially done.

  2. Following a suggestion of @juliohm, we replace orientation trait, taking values :score, :loss, or :none, and replace it with objective, taking values :max, :min or :none. We could then introduce the following new traits that might have some value

    • is_loss: true only for measures that you minimise but also which vanish in the case of "perfect predictions"

    • is_score: true only for measures that you maximise but which which take value in [0, 1] and have unit value in case of "perfect" predictions.

  3. Add functionality to take "products" of measures, for computing muli-target losses.

@azev77
Copy link

azev77 commented May 11, 2020

A few features that would be nice:

  1. similar to models(matching(X,y)) I would like measures() to give me all measures available for regression (continuous y).
  2. Nice decomposition of performance measures:
    deterministic (predictions) vs probabilistic
    a "subtype" of deterministic
    Scale-dependent measures: e := ŷ - y (mse, rmse, mae ...)
    Measures based on percentage errors: p := 100*e/y (rmsp, mape ...)
    {note: this way you only deal w/ zero denominators once & it applies to all percent based measures}

So a user can easily find:
all measures for continuous y
all scale-dependent measures for continuous y
etc

@ablaom
Copy link
Member Author

ablaom commented May 11, 2020

A few features that would be nice:

  1. similar to models(matching(X,y)) I would like measures() to give me all measures available for regression (continuous y).

Good idea: JuliaAI/MLJBase.jl#301

  1. Nice decomposition of performance measures:
    deterministic (predictions) vs probabilistic
    a "subtype" of deterministic

Not sure I understand. We have a prediction_type trait. Could you elaborate?

Scale-dependent measures: e := ŷ - y (mse, rmse, mae ...)

I don't see why not. Not sure "scale-dependent" is the best description. Is this terminology common? How about a trait called difference_based and if that is true then API expects m(difference) (difference a scalar) to be implemented, and that's all? I'm supposing that these measures would all be of the reports_each_observation=true kind.

Measures based on percentage errors: p := 100*e/y (rmsp, mape ...)

Sure. Similar to above. percentage_based could be the trait name. (Is it common to use a percentage and not just proportion?)

@azev77
Copy link

azev77 commented May 14, 2020

Not sure I understand. We have a prediction_type trait. Could you elaborate?

Just like there are many models/measures in MLJ.jl, there are many distributions in Distributions.jl.
When the Arcsine(a,b) distribution was added it was given the ContinuousUnivariateDistribution "type"

struct Arcsine{T<:Real} <: ContinuousUnivariateDistribution
    a::T
    b::T
    Arcsine{T}(a::T, b::T) where {T<:Real} = new{T}(a, b)
end

Then automatically that is a subtype of ContinuousDistribution & UnivariateDistribution & Distribution.
Thus Arcsine(a,b) is an element in ALL of the following:

using Distributions
subtypes(Distribution)
subtypes(UnivariateDistribution)
subtypes(ContinuousDistribution)
subtypes(ContinuousUnivariateDistribution)

I'm throwing ideas out here, but could it help if measures was similarly organized?
Measure the umbrella type (like Distribution)
RegressionMeasure a subtype (like ContinuousDistribution)
ScaleDependentRegressionMeasure a subtype of RegressionMeasure
PercentBasedRegressionMeasure a subtype of RegressionMeasure
etc

Then if a user wants to find all measures etc:

subtypes(Measure)
subtypes(RegressionMeasure)
subtypes(ScaleDependentRegressionMeasure)
subtypes(PercentBasedRegressionMeasure)

@azev77
Copy link

azev77 commented May 14, 2020

I don't see why not. Not sure "scale-dependent" is the best description. Is this terminology common?

You're right, it is not a flattering description (though informative).
I'm using terminology from the most cited paper on forecast accuracy.
Perhaps DifferenceRegressionMeasure PercentRegressionMeasure or whatever you think is right (and most familiar to users)...

@azev77
Copy link

azev77 commented May 14, 2020

There is also an issue w/ asymmetry for percent errors:
mape(ŷ, y) != mape(y,ŷ)
a possible solution could be to require keyword args for those types of measures?
I have mixed feelings about this.
I like parsimony.
Keyword args might be clunky, maybe have a convention that predictions go before observations in MLJ...

@ablaom
Copy link
Member Author

ablaom commented May 18, 2020

There is also an issue w/ asymmetry for percent errors:
mape(ŷ, y) != mape(y,ŷ)

Right. The API specifies that yhat goes first. We could give MAPE a field compare_with_prediction which defaults to false in the key-work constructor (just like eps exists and defaults to eps()) but I'm not sure there would be much call for this option. Do you have a use case in mind?

@ablaom
Copy link
Member Author

ablaom commented May 18, 2020

Regarding having a hierarchy of types. We don't really need this. You can do queries based on the traits. I think there is a tendency towards traits, because other packages can extend without your package being a dependency, and so forth. For example, Distributions is a quite large and MLJ is kind-of forced to include it as a dependency because it doesn't use traits. I think if they would start over, Distributions would probably use traits. And if I had my time over, I would possibly have used them more in MLJ (for model types) - but I think that ship has sailed.

Also, adding traits is much easier than changing type hierarchies you didn't quite get right.

@oxinabox May want to comment here.

@ablaom
Copy link
Member Author

ablaom commented May 18, 2020

You can see all the traits in the current API with this example:

julia> info(rms)
root mean squared; aliases: `rms`.
(name = "rms",
 target_scitype = Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}},
 supports_weights = true,
 prediction_type = :deterministic,
 orientation = :loss,
 reports_each_observation = false,
 aggregation = MLJBase.RootMeanSquare(),
 is_feature_dependent = false,
 docstring = "root mean squared; aliases: `rms`.",
 distribution_type = missing,)

@azev77
Copy link

azev77 commented May 18, 2020

Can the current API tell me which measures work w/ regression (continuous y) versus classification?

@ablaom
Copy link
Member Author

ablaom commented May 18, 2020

Sure.

Measures for a Finite univariate target (a.k.a. "classification"):

julia> measures(m -> AbstractVector{Finite} <: m.target_scitype)
19-element Array{NamedTuple{(:name, :target_scitype, :supports_weights, :prediction_type, :orientation, :reports_each_observation, :aggregation, :is_feature_dependent, :docstring, :distribution_type),T} where T<:Tuple,1}:
 (name = area_under_curve, ...)            
 (name = accuracy, ...)                    
 (name = balanced_accuracy, ...)           
 (name = cross_entropy, ...)               
 (name = FScore, ...)                      
 (name = false_discovery_rate, ...)        
 (name = false_negative, ...)              
 (name = false_negative_rate, ...)         
 (name = false_positive, ...)              
 (name = false_positive_rate, ...)         
 (name = misclassification_rate, ...)      
 (name = negative_predictive_value, ...)   
 (name = positive_predictive_value, ...)   
 (name = true_negative, ...)               
 (name = true_negative_rate, ...)          
 (name = true_positive, ...)               
 (name = true_positive_rate, ...)          
 (name = BrierScore{UnivariateFinite}, ...)
 (name = confusion_matrix, ...)      

Measures for a Continuous univariate target (aka "Regression"):

julia> measures(m -> AbstractVector{Continuous} <: m.target_scitype)
15-element Array{NamedTuple{(:name, :target_scitype, :supports_weights, :prediction_type, :orientation, :reports_each_observation, :aggregation, :is_feature_dependent, :docstring, :distribution_type),T} where T<:Tuple,1}:
 (name = l1, ...)                
 (name = l2, ...)                
 (name = mae, ...)               
 (name = mape, ...)              
 (name = rms, ...)               
 (name = rmsl, ...)              
 (name = rmslp1, ...)            
 (name = rmsp, ...)              
 (name = HuberLoss(), ...)       
 (name = L1EpsilonInsLoss(), ...)
 (name = L2EpsilonInsLoss(), ...)
 (name = LPDistLoss(), ...)      
 (name = LogitDistLoss(), ...)   
 (name = PeriodicLoss(), ...)    
 (name = QuantileLoss(), ...)    

@azev77
Copy link

azev77 commented May 18, 2020

Ahhh! that's what I was looking for. Thanks!
I guess this will be easier to realize once measures() is refactored...

@ablaom
Copy link
Member Author

ablaom commented May 25, 2020

There is also EvalMetrics.jl to look at; see JuliaAI/MLJBase.jl#316

@OkonSamuel
Copy link
Member

A related question is whether instead of returning a single value, in the case of reports_each_observation=false (eg, auc) we instead report a constant vector. This would eliminate some bothersome case distinctions, but might also be confusing.

This would eliminate some type instabilities in the evaluate method

@tlienart
Copy link

tlienart commented Jun 16, 2020

There is also EvalMetrics.jl to look at; see JuliaAI/MLJBase.jl#316

I don't think it's a serious contender after looking at their code in some amount of details (too narrow focus when we would like something as generic as possible); some of their core methods could possibly be adapted (they explicitly said they were happy with that).

@OkonSamuel
Copy link
Member

What the status with integration with EvalMetrics.jl??

@tlienart
Copy link

I don't think we should, possibly we can use some of their code for a few specific metrics but last I checked it's not really interesting for us (eg not generic enough)

@ablaom
Copy link
Member Author

ablaom commented Aug 19, 2020

JuliaAI/MLJBase.jl#395

@ablaom
Copy link
Member Author

ablaom commented Sep 3, 2020

Comment from @ven-k on slack:


While defining the struct for losses, including the y slightly improved the time taken. For ex,

struct MSE{T<:Float32}
    y::Vector{T}
end

and

struct MSE end

gave, similiar benchmarks but mean time of former was 0.1 to 0.01 μs was lesser than latter.
And wouldn't this be more intuitive as we can define an object for a target

mse = MSE(y)

and pass only yhat in each epoch as mse(yhat).

Also adding to above, we could have a wrapper function mse(yhat, y) = MSE(y)(that) to support mse(yhat, y).

@ablaom
Copy link
Member Author

ablaom commented Nov 2, 2020

LossFunctions fix: We can make measures from LossFunctions behave exactly like all the others when called by importing their names into scope, instead of using them, and exporting versions that satisfy our API.

@azev77
Copy link

azev77 commented Nov 3, 2020

I just tried measures(matching(y)) and it works like a charm!!!

There is a growing literature on Probabilistic predictions for regression models (ie predicting a conditional distribution).
ngboost.py is one of the contenders.
As explained in these slides, a common way to score probabilistic predictions is w/ Negative-LogLikelihood.
Is this measure (NLL or LL) worth adding to MLJ?

PS: here are all 47 measures I currently get

using MLJ;
a=measures()
[println(a[i]) for i in 1:length(measures())]
(name = area_under_curve, ...)
(name = accuracy, ...)
(name = balanced_accuracy, ...)
(name = cross_entropy, ...)
(name = FScore, ...)
(name = false_discovery_rate, ...)
(name = false_negative, ...)
(name = false_negative_rate, ...)
(name = false_positive, ...)
(name = false_positive_rate, ...)
(name = l1, ...)
(name = l2, ...)
(name = log_cosh, ...)
(name = mae, ...)
(name = mape, ...)
(name = matthews_correlation, ...)
(name = misclassification_rate, ...)
(name = negative_predictive_value, ...)
(name = positive_predictive_value, ...)
(name = rms, ...)
(name = rmsl, ...)
(name = rmslp1, ...)
(name = rmsp, ...)
(name = true_negative, ...)
(name = true_negative_rate, ...)
(name = true_positive, ...)
(name = true_positive_rate, ...)
(name = BrierScore{UnivariateFinite}, ...)
(name = DWDMarginLoss(), ...)
(name = ExpLoss(), ...)
(name = L1HingeLoss(), ...)
(name = L2HingeLoss(), ...)
(name = L2MarginLoss(), ...)
(name = LogitMarginLoss(), ...)
(name = ModifiedHuberLoss(), ...)
(name = PerceptronLoss(), ...)
(name = SigmoidLoss(), ...)
(name = SmoothedL1HingeLoss(), ...)
(name = ZeroOneLoss(), ...)
(name = HuberLoss(), ...)
(name = L1EpsilonInsLoss(), ...)
(name = L2EpsilonInsLoss(), ...)
(name = LPDistLoss(), ...)
(name = LogitDistLoss(), ...)
(name = PeriodicLoss(), ...)
(name = QuantileLoss(), ...)
(name = confusion_matrix, ...)

@ablaom
Copy link
Member Author

ablaom commented Nov 4, 2020

I believe we already have negative log-likelihood, aka log-loss. It is called cross_entropy and, yes, it is a proper scoring loss.

search: cross_entropy

  cross_entropy

  Cross entropy loss with probabilities clamped between eps() and 1-eps(); aliases: cross_entropy.

  ce = CrossEntropy(; eps=eps())
  ce(ŷ, y)

  Given an abstract vector of distributions ŷ and an abstract vector of true observations y, return the corresponding cross-entropy
  loss (aka log loss) scores.

  Since the score is undefined in the case of the true observation has predicted probability zero, probablities are clipped between
  eps and 1-eps where eps can be specified.

  If sᵢ is the predicted probability for the true class yᵢ then the score for that example is given by

  -log(clamp(sᵢ, eps, 1-eps))

  For more information, run info(cross_entropy).
julia> yhat = UnivariateFinite(["yes", "no"], rand(5), pool=missing, augment=true)
5-element MLJBase.UnivariateFiniteArray{Multiclass{2},String,UInt8,Float64,1}:
 UnivariateFinite{Multiclass{2}}(yes=>0.374, no=>0.626)
 UnivariateFinite{Multiclass{2}}(yes=>0.532, no=>0.468)
 UnivariateFinite{Multiclass{2}}(yes=>0.428, no=>0.572)
 UnivariateFinite{Multiclass{2}}(yes=>0.691, no=>0.309)
 UnivariateFinite{Multiclass{2}}(yes=>0.539, no=>0.461)

julia> y = rand(classes(yhat), 5)
5-element Array{CategoricalArrays.CategoricalValue{String,UInt8},1}:
 "no"
 "no"
 "yes"
 "no"
 "yes"

julia> cross_entropy(yhat, y)
5-element Array{Float64,1}:
 0.4691627141887623
 0.7594675442682963
 0.8484769383284205
 1.1752213731506886
 0.6185977143266518

@azev77
Copy link

azev77 commented Nov 4, 2020

@ablaom cross_entropy doesn't give the log-likelihood for the following:

using MLJ
X,y=@load_boston
train, test = partition(eachindex(y), .7, rng=333);

@load LinearRegressor pkg = GLM
mdl = LinearRegressor()
mach = machine(mdl, X, y)
fit!(mach, rows=train, verbosity=0)
ŷ = predict(mach, rows=test)

cross_entropy(ŷ, y[test])

ERROR: MethodError: no method matching (::MLJBase.CrossEntropy{Float64})(::Array{Distributions.Normal{Float64},1}, ::Array{Float64,1})
Closest candidates are:
  Any(::MLJBase.UnivariateFiniteArray{S,V,R,P,1}, ::AbstractArray{T,1} where T) where {S, V, R, P} at /Users/AZevelev/.julia/packages/MLJBase/Ov46j/src/measures/finite.jl:64
  Any(::AbstractArray{var"#s577",1} where var"#s577"<:UnivariateFinite, ::AbstractArray{T,1} where T) at /Users/AZevelev/.julia/packages/MLJBase/Ov46j/src/measures/finite.jl:57
Stacktrace:
 [1] top-level scope at none:1

@ablaom
Copy link
Member Author

ablaom commented Nov 4, 2020

Ah yes. cross_entropy is for Finite targets only. And yes, it has an extension to the continuous case (https://www-jstor-org.ezproxy.auckland.ac.nz/stable/2629907?seq=3&socuuid=dcad1753-575a-42c3-a7b8-fdd39a9f7589&socplat=email#metadata_info_tab_contents ) but it is not yet implemented. Indeed no measure for probabilistic predictors of Continuous targets is yet implemented 😢 See also JuliaAI/MLJBase.jl#395

@azev77
Copy link

azev77 commented Nov 4, 2020

For the continuous case, I doubt it would be called cross_entropy, just log-likelihood is prob fine

@ablaom ablaom pinned this issue Nov 15, 2020
@ablaom
Copy link
Member Author

ablaom commented Nov 16, 2020

JuliaAI/MLJBase.jl#450

@ablaom
Copy link
Member Author

ablaom commented Jun 28, 2021

Lighthouse has some measures we may want to include: JuliaAI/MLJBase.jl#586

@ablaom ablaom changed the title Improvements to measures [Tracking] Improvements to measures Aug 25, 2021
@ablaom
Copy link
Member Author

ablaom commented Aug 25, 2021

Community discussion on mitigating metric code fragmentation

FluxML/FluxML-Community-Call-Minutes#38

@tlienart
Copy link

tlienart commented Aug 25, 2021

(sorry super old message but...

One option is return a missing there and make sure the aggregators do skipmissing. There is no skipnan, which might be more natural.

You can use

skipnan(x) = Iterators.filter(!isnan, x)

(see also JuliaLang/julia#35162)

and going through the thread, the option to get auc to return a vector seems pretty weird to me, if it returns it internally to eliminate type instability, fine, but not to the user, maybe a way to do this is in implementing a show for the measure.

@ablaom
Copy link
Member Author

ablaom commented Aug 25, 2021

A related question is whether instead of returning a single value

No there doesn't seem to be much stomach for this suggestion.

make sure the aggregators do skipmissing

Done. I had forgotten about NaN's though.

@pat-alt
Copy link

pat-alt commented Dec 6, 2022

Would be nice to have various metrics from CalibrationErrors.jl by @devmotion added. I may chip in myself once I've sorted out this one and by then will have hopefully understood how it works (but will be some time before I get to it).

@ablaom ablaom transferred this issue from JuliaAI/MLJBase.jl Jan 17, 2024
@github-project-automation github-project-automation bot moved this to priority high / involved in General Aug 30, 2024
@ablaom ablaom moved this from priority high / involved to tracking/discussion/metaissues/misc in General Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: tracking/discussion/metaissues/misc
Development

No branches or pull requests

5 participants