Skip to content

Conversation

@lishen
Copy link
Contributor

@lishen lishen commented Jul 27, 2016

Contributions in this PR:

  • Add sklearn regressors.
  • Optimize hyperparameter search space for classifiers/regressors so that the learners give good performance in most cases.
  • Add K-fold, leave-one-out, shuffle-and-split cross-validation with shuffle option and stratification for classification.
  • Add a lag selector as preprocessing for time series forecasting problems.
  • Refactoring to improve code modularity and readability.
  • Fix obsolete tests so that they run under current version of hpsklearn.
  • Misc. bug fixes and enhancements.

mlmlm and others added 14 commits October 6, 2014 23:08
* The default hyperparameters seach spaces are specified so that they
  shall give reasonably "good" solutions for most cases.
* sklearn regressors are added to the components.py. However, the estimator.py
  still needs to be modified to evaluate regression.
* The naming of the variables is made consistent across board.
* Fix a bug that will make the online learners (e.g. sgd) fail to run when the
  trial timeout is not specified.
* Make partial fit an option.
* Allow specifiying validation size.
* Misc. changes
* Make the hyperopt estimator accept regressors.
* Fix bugs and assertion errors for regressors.
* Fix PCA number of components error.
* Add time series lag selector for SVR, which treats lag sizes as
  hyperparameters. The lag sizes can be specified for both endogenous
  and exogenous predictors.
* Define a lag selector class for KNN regression but has not added
  it to components.py yet.
* Optimize SVM hyperparameters search space by considering their
  dependencies. This has only been applied to the SVR lag selector
  so far.
* Some code refactoring to make it easier to maintain.
* Add KNN regression lag selector.
* Fix a bug that failed to identify the underlying learner in a Hyperopt object
  as a lag selector.
* Refactor lag selectors as preprocessors.
* Refactor hyperopt-sklearn estimator to improve modularity.
* Refactor SVM, KNN and trees ensemble functions to improve modularity.
* Misc. minor enhancements to code readability.
* Improve cross-validation by adding K-fold, shuffle-and-split, leave-one-out
  with a shuffle option and stratification for classification.
* Some test code become obsolete since skdata and hpsklearn have
  evolved.
* Minor bug and code format fixes.

NOTE: there is still a load iris.csv error when run the tests.
* Rename test_demo.py to test_demo.py.bak since it is not compatible with the
  current APIs of skdata.
* Mask TestSpace since it failed. Error message: >>>> AttributeError:
  'NoneType' object has no attribute 'multinomial' <<<<
* Add trial_timeout in test_stories.py so that it can finish in reasonable time.
* Fix PCA number of components issue. Setting it to be a float may cause the
  train and test sets to have different numbers of components. Now it is
  reverted to the original solution.
* Limit the PCA number of components at the number of features. This is done
  dynamically at run time.
* Add a missing name function for RBM.
* Remove assertion check for preprocessings.

NOTES: test_demo.py and TestSpace are currently masked. All other tests passed.
@lishen
Copy link
Contributor Author

lishen commented Jul 27, 2016

@jaberg Please offer some insights as to why TestSpace did not work.

@jaberg @bjkomer I can switch from skdata to sklearn.datasets for the test_demo. I think it will improve the reliability of the test code. Please let me know if you want me to do so.

@lishen
Copy link
Contributor Author

lishen commented Aug 1, 2016

I know I have made a lot of changes to the code. Please let me know if there is anything I can do to make it clearer. I can add some additional tests for the regression and other changes. I can also do some rebasing to clean up some commits. Let me know what you think. @jaberg @bjkomer

@jaberg
Copy link
Contributor

jaberg commented Aug 1, 2016

This is exciting! I will look at it some time this week, please ping me if
I forget.

On Aug 1, 2016 9:48 AM, "Li Shen" [email protected] wrote:

I know I have made a lot of changes to the code. Please let me know if
there is anything I can do to make it clearer. I can add some additional
tests for the regression and other changes. I can also do some rebasing to
clean up some commits. Let me know what you think. @jaberg
https://github.com/jaberg @bjkomer https://github.com/bjkomer


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#41 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKdDBBeQiWJZrNA3yfiR79tFJRup-yOks5qbfkvgaJpZM4JWsI0
.

@lishen
Copy link
Contributor Author

lishen commented Aug 5, 2016

Hi @jaberg, what's the status of code review? Any feedback? Thanks!

Li

@jaberg
Copy link
Contributor

jaberg commented Aug 6, 2016

Hi Li!

I will try to make time for this next week, but I'm not actively
maintaining or using this code these days. Maybe @bjkomer will have some
thoughts about it too? What do you think Brent?

On Fri, Aug 5, 2016 at 9:45 AM, Li Shen [email protected] wrote:

Hi @jaberg https://github.com/jaberg, what's the status of code review?
Any feedback? Thanks!

Li


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#41 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKdDBcHkrjmzxisQpDK5pCXIm4ruiS8ks5qcz5-gaJpZM4JWsI0
.

@bjkomer
Copy link
Member

bjkomer commented Aug 6, 2016

Lots of improvements here, great job! I'll look through these changes and try them out more this weekend but I'm not as familiar with the code as I used to be.

I like your idea of switching from skdata to sklearn.datasets for the tests, as sometimes problems with importing it arise, and less dependencies is always nice.

lishen added 3 commits August 8, 2016 16:33
* Add back test_demo.py and change the Iris dataset from skdata
  to sklearn.
* Change the display of the loss for each trial from graphical to
  textual output.
@lishen
Copy link
Contributor Author

lishen commented Aug 8, 2016

@jaberg @bjkomer
Some updates here:

  • I added back the test_demo.py and changed the Iris dataset from skdata to sklearn.
  • I changed the display of the loss function in the demo from graphical to textual output. I tried to use the original code, which is supposed to show a plot of the losses vs. iteration but was not successful. Changing the output to textual is much easier and should be universally available on all platforms. It also makes the code less dependent on other packages.
  • I added a regression demo using the sklearn Boston dataset so you can be sure the regressors are working happily with hyperopt.
  • I tried rebasing some of the commits but wasn't successful because there are parallel histories in the repo. So I just leave the commits as is. The only thing I tried to rebase is the time series lag selector because I changed the design. Anyway, the code works fine. I just tried to make the commits look more concise.

@bjkomer
Copy link
Member

bjkomer commented Aug 8, 2016

Looks great!! All the tests are working fine on my machine, and all of my old code still works. Nothing is jumping out at me that I think needs to be changed.

* Add AdaBoost and GradientBoosting regressors.
* Fix an error in user supplied KNN parameters.
* Fix an issue that the number of PCA components may be larger than allowed
  by data.
* Modify the parameters for test_preproc in test_stories.py so that they can
  finish in reasonable time.
* Misc. enhancements of the messages for tests.
@lishen
Copy link
Contributor Author

lishen commented Aug 12, 2016

Just added AdaBoost and GradientBoosting regressors. Fixed a few minor bugs.

I had to tune down some parameters for the test_preproc in test_stories.py because it is taking too long to finish. If I use trial_timeout, the preprocessing may not finish properly, which will affect follow-up training and testing.

All tests passed. @bjkomer @jaberg

@lishen
Copy link
Contributor Author

lishen commented Aug 12, 2016

Added a new test for time series forecast. This new test serves two purposes:

  1. Illustrate the use of lag selectors and exogenous data.
  2. Demonstrate a way to format time series data for use with sklearn + hyperopt.

I hope these additional tests and modifications will strengthen the reliability of the code in this PR. If nothing else needs to be changed, I hope you can accept this PR since I believe many people can benefit from the new additions. @bjkomer @jaberg

@bjkomer
Copy link
Member

bjkomer commented Aug 14, 2016

Excellent work! You're right, a lot of people can benefit from this so I'll get it merged now. If anything comes up that needs to be changed that can always be dealt with in another PR.

@bjkomer bjkomer merged commit 2f41ab9 into hyperopt:master Aug 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants