TODO: Modify this readme once the manuscript is done, and change it to something more descriptive
Project objective: Get a (robust as possible) subset of the most important wavebands for predicting soil gene abundance values.
- Python instead of R
- Only elastic net models (random forest had bad results previously, and elastic net is a generalization of lasso models so those are implicitly included in consideration)
- Finish implementing basic modeling in scikit-learn
- Figure out why the existing elastic net model isn't getting good results. Probably have to do hyperparameter training from scratch instead of transferring in prior HPs.
- Try to get as robust of a model as possible
- Implement feature selection methods to give a set of the top x wavebands
- Filter methods: These are applied to the data based on its statistical properties. No modeling needed.
- Correlation filter threshold?
- Chi-squared threshold? (This one suspect since it may be for categorical, not numerical data. Need to look into this)
- Embedded methods: These use internal properties of the models themselves
- Coefficients for elastic net
- Wrapper methods: These are model-agnostic, and (according to a few sources) generally the most robust out of the three types.
- Recursive feature elimination
- Permutation importance
- Filter methods: These are applied to the data based on its statistical properties. No modeling needed.
- Get a way to algorithmically find a consensus among the waveband sets from part 2
- Repeat part 1, but on the results of part 3
- Analyze results
- Write up paper (doubles work for MLSC and for SoutheastCon)
- Create presentation
- gives an example of hyperparameter searching over a pipeline, which is already what we need to do. But further, it even tests multiple dimensionality reduction methods simultaneously. Is this basically what we want? I still need to look into it.