Skip to content

Commit 97548d0

Browse files
authored
Initial commit
0 parents  commit 97548d0

22 files changed

+915
-0
lines changed

.flake8

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#########################
2+
# Flake8 Configuration #
3+
# (.flake8) #
4+
#########################
5+
[flake8]
6+
ignore =
7+
# asserts are ok when testing.
8+
S101
9+
# pickle
10+
S301
11+
# pickle
12+
S403
13+
S404
14+
S603
15+
# Line break before binary operator (flake8 is wrong)
16+
W503
17+
# Ignore the spaces black puts before columns.
18+
E203
19+
# allow path extensions for testing.
20+
E402
21+
DAR101
22+
DAR201
23+
# flake and pylance disagree on linebreaks in strings.
24+
N400
25+
exclude =
26+
.tox,
27+
.git,
28+
__pycache__,
29+
docs/source/conf.py,
30+
build,
31+
dist,
32+
tests/fixtures/*,
33+
*.pyc,
34+
*.bib,
35+
*.egg-info,
36+
.cache,
37+
.eggs,
38+
data.
39+
max-line-length = 120
40+
max-complexity = 20
41+
import-order-style = pycharm
42+
application-import-names =
43+
seleqt
44+
tests

.github/workflows/test.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Tests
2+
3+
on: [ push, pull_request ]
4+
5+
jobs:
6+
tests:
7+
name: Tests
8+
runs-on: ${{ matrix.os }}
9+
strategy:
10+
matrix:
11+
os: [ ubuntu-latest ]
12+
python-version: [3.11.0]
13+
steps:
14+
- uses: actions/checkout@v2
15+
- name: Set up Python ${{ matrix.python-version }}
16+
uses: actions/setup-python@v2
17+
with:
18+
python-version: ${{ matrix.python-version }}
19+
- name: Install dependencies
20+
run: pip install nox
21+
- name: Test with pytest
22+
run:
23+
nox -s test
24+
lint:
25+
name: Lint
26+
runs-on: ubuntu-latest
27+
strategy:
28+
matrix:
29+
python-version: [3.11.0]
30+
steps:
31+
- uses: actions/checkout@v2
32+
- name: Set up Python ${{ matrix.python-version }}
33+
uses: actions/setup-python@v2
34+
with:
35+
python-version: ${{ matrix.python-version }}
36+
- name: Install dependencies
37+
run: pip install nox
38+
- name: Run flake8
39+
run: nox -s lint
40+
- name: Run mypy
41+
run: nox -s typing

.gitignore

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
.vscode/
2+
.pytest_cache/
3+
4+
# Byte-compiled / optimized / DLL files
5+
__pycache__/
6+
*.py[cod]
7+
*$py.class
8+
9+
# C extensions
10+
*.so
11+
12+
# Distribution / packaging
13+
.Python
14+
build/
15+
develop-eggs/
16+
dist/
17+
downloads/
18+
eggs/
19+
.eggs/
20+
lib/
21+
lib64/
22+
parts/
23+
sdist/
24+
var/
25+
wheels/
26+
share/python-wheels/
27+
*.egg-info/
28+
.installed.cfg
29+
*.egg
30+
MANIFEST
31+
32+
# PyInstaller
33+
# Usually these files are written by a python script from a template
34+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
35+
*.manifest
36+
*.spec
37+
38+
# Installer logs
39+
pip-log.txt
40+
pip-delete-this-directory.txt
41+
42+
# Unit test / coverage reports
43+
htmlcov/
44+
.tox/
45+
.nox/
46+
.coverage
47+
.coverage.*
48+
.cache
49+
nosetests.xml
50+
coverage.xml
51+
*.cover
52+
*.py,cover
53+
.hypothesis/
54+
.pytest_cache/
55+
cover/
56+
57+
# Translations
58+
*.mo
59+
*.pot
60+
61+
# Django stuff:
62+
*.log
63+
local_settings.py
64+
db.sqlite3
65+
db.sqlite3-journal
66+
67+
# Flask stuff:
68+
instance/
69+
.webassets-cache
70+
71+
# Scrapy stuff:
72+
.scrapy
73+
74+
# Sphinx documentation
75+
docs/_build/
76+
77+
# PyBuilder
78+
.pybuilder/
79+
target/
80+
81+
# Jupyter Notebook
82+
.ipynb_checkpoints
83+
84+
# IPython
85+
profile_default/
86+
ipython_config.py
87+
88+
# pyenv
89+
# For a library or package, you might want to ignore these files since the code is
90+
# intended to run in multiple environments; otherwise, check them in:
91+
# .python-version
92+
93+
# pipenv
94+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
95+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
96+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
97+
# install all needed dependencies.
98+
#Pipfile.lock
99+
100+
# poetry
101+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
102+
# This is especially recommended for binary packages to ensure reproducibility, and is more
103+
# commonly ignored for libraries.
104+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
105+
#poetry.lock
106+
107+
# pdm
108+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
109+
#pdm.lock
110+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
111+
# in version control.
112+
# https://pdm.fming.dev/#use-with-ide
113+
.pdm.toml
114+
115+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
116+
__pypackages__/
117+
118+
# Celery stuff
119+
celerybeat-schedule
120+
celerybeat.pid
121+
122+
# SageMath parsed files
123+
*.sage.py
124+
125+
# Environments
126+
.env
127+
.venv
128+
env/
129+
venv/
130+
ENV/
131+
env.bak/
132+
venv.bak/
133+
134+
# Spyder project settings
135+
.spyderproject
136+
.spyproject
137+
138+
# Rope project settings
139+
.ropeproject
140+
141+
# mkdocs documentation
142+
/site
143+
144+
# mypy
145+
.mypy_cache/
146+
.dmypy.json
147+
dmypy.json
148+
149+
# Pyre type checker
150+
.pyre/
151+
152+
# pytype static type analyzer
153+
.pytype/
154+
155+
# Cython debug symbols
156+
cython_debug/
157+
158+
# PyCharm
159+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
160+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
161+
# and can be added to the global gitignore or merged into this file. For a more nuclear
162+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
163+
#.idea/

README.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Cluster Analysis Exercise - k-Means(++) and Data-Compression
2+
3+
Today we will dive into unsupervised learning and experiment with *clustering*. k-means(++) is a popular choice (maybe even the most widely used clustering algorithm) when it comes to clustering data. Therefore, we will explore this algorithm a bit more in the following three tasks. In this exercise, we will continue to use `sklearn`, which implements a variety of clustering algorithms in the `sklearn.cluster` package.
4+
5+
### Task 1: A detailed look into k-Means
6+
7+
The goal of this exercise part is to peek under the hood of this algorithm in order to empirically explore its strengths and weaknesses. Initially, we will use synthetic data to develop a basic understanding of the algorithm's performance characteristics.
8+
9+
In the literature on cluster analysis, k-means often refers not only to the clustering algorithm but also to the underlying optimization problem:
10+
11+
$$\min_{C \subset \mathbb{R}^d, |C| = k} \underbrace{\sum_{x \in P} \min_{c \in C} \lVert x - c \rVert^2}_{\text{Inertia}}$$
12+
13+
In cases where k-means refers to the problem formulation, the algorithm itself is sometimes called the Lloyd's algorithm. The algorithm repeats two steps until it converges:
14+
***
15+
1. Assign each data point to its nearest cluster center based on the squared Euclidean norm.
16+
17+
2. Update the centers by calculating the mean of each cluster center and using it as the new cluster center.
18+
***
19+
This algorithm will always converge and find a solution. However, there is no guarantee that this solution is the best solution, and the algorithm may converge slowly. Also, the algorithm requires an initial guess for the cluster centers. This is usually done by randomly selecting some of the points to be the initial centers. Therefore, it is good practice to run the algorithm several times with different initial center guesses to find a solution that is hopefully close to the best solution.
20+
21+
Fortunately, the implementation of k-means in `sklearn` takes care of all these details and provides us with a simple interface to control all these aspects.
22+
23+
Navigate to `src/ex1_kmeans.py`. Implement the first part of the `plot_kmeans_clustering` function as follows:
24+
25+
1. Load the input data from the given path. You can now run the file and examine the data.
26+
2. k-means clustering is scale sensitive. This means that we generally need to rescale our input data before performing clustering. Note that our `plot_kmeans_clustering` function has a `standardize` parameter that is set to `False` by default. Standardize the data according to $x_i = \frac{x_i - \mu}{\sigma}$ where $\mu$ is the sample mean, $\sigma$ is the sample standard deviation, in case that `standardize` is set to `True`. `sklearn.preprocessing` may be helpful.
27+
28+
Now we want to perform k-means clustering. Implement the `perform_kmeans_clustering` function following these steps:
29+
3. Use `sklearn.cluster.KMeans` to train on the given data. Set the parameter `init`, which controls the initialization of the cluster centers, to `random`. There is a better way to set this value, but we will discuss that in Task 3.
30+
4. Retrieve the cluster centers and predict the cluster index for each point.
31+
5. Return the inertia as a float, the cluster centers and the predicted cluster indices as an array each.
32+
33+
Go back to the `plot_kmeans_clustering` function and finish the remaining TODOs:
34+
35+
6. Call the `perform_kmeans_clustering` function three times. Visualize the data points, cluster centers and the assignment of data points to cluster centers in a single scatter plot. To do this, use a for-loop and `scatter_clusters_2d` to plot all the results in one plot (have a closer look at matplotlib subplots and axes).
36+
7. Return the figure object.
37+
38+
8. Now have a look at the `main` function. The default number $k$ of clusters is not optimal. Experiment with different values and set the number of k-means clusters you want to use.
39+
40+
Note: Data preprocessing benefits greatly from expert knowledge of the field/application in which the data was measured. Some preprocessing methods may not be applicable in certain settings.
41+
42+
43+
#### Decision Boundaries
44+
45+
Recall that k-means assigns a given point $x$ to a center $c_i$ if there is no center $c_j$ with a smaller squared Euclidean distance. This corresponds to the cells of a Voronoi diagram and could look like this:
46+
47+
![Voronoi diagram](./figures/Euclidean_Voronoi_diagram.svg)
48+
49+
Each cell in this diagram is the set of points which are closest to a center:
50+
51+
$$R_j = \{x \in X \mid d(x, c_j) \leq d(x, c_i) \text{ for all } j \neq i\}.$$
52+
53+
A Voronoi diagram can be used as a tool to visualize the boundaries of the k-means cluster, but is also useful as a tool to understand the algorithm.
54+
55+
9. Navigate to the `plot_decision_boundaries` function and load, preprocess and cluster the synthetic data using the function`perform_kmeans_clustering` again.
56+
10. Use `Voronoi` and `voronoi_plot_2d` from the `scipy.spatial` package to visualize the boundaries of the k-means clusters. Again use the `ax` object of the plot and `scatter_clusters_2d`.
57+
11. Test your code with the test framework of vscode or by typing `nox -r -s test` in your terminal.
58+
12. (Optional) Which assumptions/limitations of the k-Means algorithm are illustrated by this visualization?
59+
60+
### Task 2: Data compression - Color Quantization
61+
62+
A common application of clustering is data compression, where large amounts of data are reduced by extracting and keeping only the "important" aspects (i.e., cluster centers, covariance matrices and weights). You may want to use this technique if you cannot store or transmit all the measurements. It can also be useful to reduce the amount of data if a tool/function you want to use in your analysis has a runtime that makes it infeasible to use on thousands or millions of data points. Sometimes, k-means takes a long time to perform the clustering (although you can apply it to large datasets). If you want to speed things up, you can consider using `MiniBatchKMeans`, which performs k-means clustering by drawing multiple random subsets of the entire data.
63+
64+
In this task the goal is to reduce the storage requirement of an image with width $w$ and height $h$ from the dimension $3\cdot w\cdot h$ to $w\cdot h + 3\cdot k$ via clustering:
65+
66+
1. Open the file `src/ex2_image_compression.py`. The image is loaded in the `main` function using the `load_image` function. Inspect the `input_img` variable and print the information about its dimensions.
67+
68+
Implement the `compress_colorspace` function using the k-means algorithm:
69+
2. Reshape the input image into $(w\cdot h, 3)$ to perform clustering on colors.
70+
3. Use `MiniBatchKMeans` to cluster the image into $k$ clusters.
71+
4. Return a compressed image where the number of unique colors where reduced from $256^3$ to $k$ via k-means clustering. The compressed image must have the same shape as the original one.
72+
73+
5. Use `compress_colorspace` in your `main` function to compress the image for $k \in \{2,8,64,256\}$ and plot the result using imshow. Set the corresponding value of $k$ as title for each result.
74+
75+
6. Test your code with the test framework of vscode or by typing `nox -r -s test` in your terminal.
76+
77+
### Task 3 (Optional): k-Means++
78+
79+
As mentioned above, Lloyd's algorithm requires an initial set of centers. Looking more closely at the sklearn documentation, the second argument allows us to use either uniformly randomly selected points or something called "kmeans++" or user-defined array as the initial set of centers. The main contribution of k-means++ is a clever strategy for choosing the initial centers:
80+
***
81+
1. Choose a point $x_1 \in P$ uniformly at random, set $C^1 = \{ x_1 \}$.
82+
2. **for** $i = 0$ to $k$ **do**:
83+
3. $\qquad$ Draw a point $x_i \in P$ according to the probability distribution
84+
85+
$$\frac{\min_{c \in C^{i-1}} \lVert x-c \rVert_2^2}{\sum_{y \in P} \min_{c \in C^{i-1}} \lVert y - c \rVert_2^2}$$
86+
87+
4. $\qquad$ Set $C^{i} = C^{i-1} \cup \{x_i\}$.
88+
5. **end for**
89+
***
90+
91+
Navigate into `src/ex3_kmeans_plus_plus.py` and have a look at the code.
92+
93+
1. Implement the `uniform_sampling` function by drawing points uniformly from the datasets.
94+
2. Implement the `d2_sampling` function using the $D^2$ sampling algorithm described above.
95+
3. Compare the results on the two datasets by executing the scirpt with `python ./src/ex3_kmeans_plus_plus.py`. Which advantages does $D^2$ sampling provide as an initialization?
96+
**Hint**: (Weighted) sampling with and without replacement can be performed using `np.random.choice`.
97+
4. Test your code with the test framework of vscode or by typing `nox -r -s test` in your terminal.
98+
99+
### Task 4 (Optional): Comparison between k-Means and Gaussian Mixture Models
100+
101+
Navigate into `src/ex4_gmm.py` and have a look at the code. We are creating synthetic dataset with three classes (the same that we used in the lecture) an want to compare k-means clustering and GMMs. If you want, you can take the diabetes dataset from Day 04. Implement the TODOs in the file.

data/images/saint_sulpice.jpg

991 KB
Loading
12.6 KB
Binary file not shown.
26.7 KB
Binary file not shown.
15.8 KB
Binary file not shown.

0 commit comments

Comments
 (0)