Add sklearn regressors, optimize hyperparameter search space and enhance cross-validation #41

lishen · 2016-07-27T22:23:58Z

Contributions in this PR:

Add sklearn regressors.
Optimize hyperparameter search space for classifiers/regressors so that the learners give good performance in most cases.
Add K-fold, leave-one-out, shuffle-and-split cross-validation with shuffle option and stratification for classification.
Add a lag selector as preprocessing for time series forecasting problems.
Refactoring to improve code modularity and readability.
Fix obsolete tests so that they run under current version of hpsklearn.
Misc. bug fixes and enhancements.

* The default hyperparameters seach spaces are specified so that they shall give reasonably "good" solutions for most cases. * sklearn regressors are added to the components.py. However, the estimator.py still needs to be modified to evaluate regression. * The naming of the variables is made consistent across board. * Fix a bug that will make the online learners (e.g. sgd) fail to run when the trial timeout is not specified. * Make partial fit an option. * Allow specifiying validation size. * Misc. changes

* Make the hyperopt estimator accept regressors. * Fix bugs and assertion errors for regressors. * Fix PCA number of components error.

* Add time series lag selector for SVR, which treats lag sizes as hyperparameters. The lag sizes can be specified for both endogenous and exogenous predictors. * Define a lag selector class for KNN regression but has not added it to components.py yet. * Optimize SVM hyperparameters search space by considering their dependencies. This has only been applied to the SVR lag selector so far. * Some code refactoring to make it easier to maintain.

* Add KNN regression lag selector. * Fix a bug that failed to identify the underlying learner in a Hyperopt object as a lag selector.

* Refactor lag selectors as preprocessors. * Refactor hyperopt-sklearn estimator to improve modularity. * Refactor SVM, KNN and trees ensemble functions to improve modularity. * Misc. minor enhancements to code readability.

* Improve cross-validation by adding K-fold, shuffle-and-split, leave-one-out with a shuffle option and stratification for classification.

* Some test code become obsolete since skdata and hpsklearn have evolved. * Minor bug and code format fixes. NOTE: there is still a load iris.csv error when run the tests.

* Rename test_demo.py to test_demo.py.bak since it is not compatible with the current APIs of skdata. * Mask TestSpace since it failed. Error message: >>>> AttributeError: 'NoneType' object has no attribute 'multinomial' <<<< * Add trial_timeout in test_stories.py so that it can finish in reasonable time. * Fix PCA number of components issue. Setting it to be a float may cause the train and test sets to have different numbers of components. Now it is reverted to the original solution. * Limit the PCA number of components at the number of features. This is done dynamically at run time. * Add a missing name function for RBM. * Remove assertion check for preprocessings. NOTES: test_demo.py and TestSpace are currently masked. All other tests passed.

lishen · 2016-07-27T22:31:14Z

@jaberg Please offer some insights as to why TestSpace did not work.

@jaberg @bjkomer I can switch from skdata to sklearn.datasets for the test_demo. I think it will improve the reliability of the test code. Please let me know if you want me to do so.

lishen · 2016-08-01T13:48:30Z

I know I have made a lot of changes to the code. Please let me know if there is anything I can do to make it clearer. I can add some additional tests for the regression and other changes. I can also do some rebasing to clean up some commits. Let me know what you think. @jaberg @bjkomer

jaberg · 2016-08-01T13:53:25Z

This is exciting! I will look at it some time this week, please ping me if
I forget.

On Aug 1, 2016 9:48 AM, "Li Shen" [email protected] wrote:

I know I have made a lot of changes to the code. Please let me know if
there is anything I can do to make it clearer. I can add some additional
tests for the regression and other changes. I can also do some rebasing to
clean up some commits. Let me know what you think. @jaberg
https://github.com/jaberg @bjkomer https://github.com/bjkomer

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#41 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKdDBBeQiWJZrNA3yfiR79tFJRup-yOks5qbfkvgaJpZM4JWsI0
.

lishen · 2016-08-05T13:45:33Z

Hi @jaberg, what's the status of code review? Any feedback? Thanks!

Li

jaberg · 2016-08-06T01:06:35Z

Hi Li!

I will try to make time for this next week, but I'm not actively
maintaining or using this code these days. Maybe @bjkomer will have some
thoughts about it too? What do you think Brent?

On Fri, Aug 5, 2016 at 9:45 AM, Li Shen [email protected] wrote:

Hi @jaberg https://github.com/jaberg, what's the status of code review?
Any feedback? Thanks!

Li

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#41 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKdDBcHkrjmzxisQpDK5pCXIm4ruiS8ks5qcz5-gaJpZM4JWsI0
.

bjkomer · 2016-08-06T16:35:18Z

Lots of improvements here, great job! I'll look through these changes and try them out more this weekend but I'm not as familiar with the code as I used to be.

I like your idea of switching from skdata to sklearn.datasets for the tests, as sometimes problems with importing it arise, and less dependencies is always nice.

* Add back test_demo.py and change the Iris dataset from skdata to sklearn. * Change the display of the loss for each trial from graphical to textual output.

lishen · 2016-08-08T21:12:19Z

@jaberg @bjkomer
Some updates here:

I added back the test_demo.py and changed the Iris dataset from skdata to sklearn.
I changed the display of the loss function in the demo from graphical to textual output. I tried to use the original code, which is supposed to show a plot of the losses vs. iteration but was not successful. Changing the output to textual is much easier and should be universally available on all platforms. It also makes the code less dependent on other packages.
I added a regression demo using the sklearn Boston dataset so you can be sure the regressors are working happily with hyperopt.
I tried rebasing some of the commits but wasn't successful because there are parallel histories in the repo. So I just leave the commits as is. The only thing I tried to rebase is the time series lag selector because I changed the design. Anyway, the code works fine. I just tried to make the commits look more concise.

bjkomer · 2016-08-08T23:59:31Z

Looks great!! All the tests are working fine on my machine, and all of my old code still works. Nothing is jumping out at me that I think needs to be changed.

* Add AdaBoost and GradientBoosting regressors. * Fix an error in user supplied KNN parameters. * Fix an issue that the number of PCA components may be larger than allowed by data. * Modify the parameters for test_preproc in test_stories.py so that they can finish in reasonable time. * Misc. enhancements of the messages for tests.

lishen · 2016-08-12T13:59:35Z

Just added AdaBoost and GradientBoosting regressors. Fixed a few minor bugs.

I had to tune down some parameters for the test_preproc in test_stories.py because it is taking too long to finish. If I use trial_timeout, the preprocessing may not finish properly, which will affect follow-up training and testing.

All tests passed. @bjkomer @jaberg

lishen · 2016-08-12T19:18:18Z

Added a new test for time series forecast. This new test serves two purposes:

Illustrate the use of lag selectors and exogenous data.
Demonstrate a way to format time series data for use with sklearn + hyperopt.

I hope these additional tests and modifications will strengthen the reliability of the code in this PR. If nothing else needs to be changed, I hope you can accept this PR since I believe many people can benefit from the new additions. @bjkomer @jaberg

bjkomer · 2016-08-14T17:50:01Z

Excellent work! You're right, a lot of people can benefit from this so I'll get it merged now. If anything comes up that needs to be changed that can always be dealt with in another PR.

mlmlm and others added 14 commits October 6, 2014 23:08

Added sklearn regression

01df5db

Auto-format code according to PEP8

b55524b

Make sklearn regressors work with hyperopt

747dc67

* Make the hyperopt estimator accept regressors. * Fix bugs and assertion errors for regressors. * Fix PCA number of components error.

Add KNR lag selector and bug fix

02149cf

* Add KNN regression lag selector. * Fix a bug that failed to identify the underlying learner in a Hyperopt object as a lag selector.

Add random forest and extra trees regression lag selectors

a966d8b

Add K-fold cross-validation, shuffle-split and leave-one-out

23e4ce3

* Improve cross-validation by adding K-fold, shuffle-and-split, leave-one-out with a shuffle option and stratification for classification.

Add RandomState instance as input for estimator seed

c26775d

Merge branch 'regression'

e5a7da2

Merge changes from upstream/master

5744099

[WIP] Fix obsolete tests

41b7141

* Some test code become obsolete since skdata and hpsklearn have evolved. * Minor bug and code format fixes. NOTE: there is still a load iris.csv error when run the tests.

lishen added 3 commits August 8, 2016 16:33

Add back test_demo.py

7b31003

* Add back test_demo.py and change the Iris dataset from skdata to sklearn. * Change the display of the loss for each trial from graphical to textual output.

Increase trial timeout in test_stories.py

da686b1

Add regression demo using sklearn Boston dataset

ca30401

bjkomer mentioned this pull request Aug 10, 2016

estimator.fit crashes for sparse data in spite of using any_sparse_classifier() #43

Open

Add time series forecast test

eba8173

bjkomer merged commit 2f41ab9 into hyperopt:master Aug 14, 2016

bjkomer mentioned this pull request Feb 16, 2017

Updating parameters of SVC #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sklearn regressors, optimize hyperparameter search space and enhance cross-validation #41

Add sklearn regressors, optimize hyperparameter search space and enhance cross-validation #41

Uh oh!

lishen commented Jul 27, 2016

Uh oh!

lishen commented Jul 27, 2016 •

edited

Loading

Uh oh!

lishen commented Aug 1, 2016

Uh oh!

jaberg commented Aug 1, 2016

Uh oh!

lishen commented Aug 5, 2016

Uh oh!

jaberg commented Aug 6, 2016

Uh oh!

bjkomer commented Aug 6, 2016

Uh oh!

lishen commented Aug 8, 2016

Uh oh!

bjkomer commented Aug 8, 2016

Uh oh!

lishen commented Aug 12, 2016

Uh oh!

lishen commented Aug 12, 2016

Uh oh!

bjkomer commented Aug 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add sklearn regressors, optimize hyperparameter search space and enhance cross-validation #41

Add sklearn regressors, optimize hyperparameter search space and enhance cross-validation #41

Uh oh!

Conversation

lishen commented Jul 27, 2016

Uh oh!

lishen commented Jul 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishen commented Aug 1, 2016

Uh oh!

jaberg commented Aug 1, 2016

Uh oh!

lishen commented Aug 5, 2016

Uh oh!

jaberg commented Aug 6, 2016

Uh oh!

bjkomer commented Aug 6, 2016

Uh oh!

lishen commented Aug 8, 2016

Uh oh!

bjkomer commented Aug 8, 2016

Uh oh!

lishen commented Aug 12, 2016

Uh oh!

lishen commented Aug 12, 2016

Uh oh!

bjkomer commented Aug 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lishen commented Jul 27, 2016 •

edited

Loading