IE MBD OCT 2018 | ADVANCED PYTHON | Group H | May 2019
Leandro Handal | Martin Hofbauer | Ashley O'Mahony | Gerald Walravens
ie_pandas is a Python package developed for a group project. It allows the user to create DataFrame objects, in a pandas style.
All the files of this project are saved in a GitHub repository.
DataFrame objects can be created from:
- a list
- a numpy array
- a list of lists
- a dictionary of lists
- a dictionary of numpy arrays
DataFrame can handle data of types:
int: integer (1,2,3...)float: decimal (1.0,1.1,1.2...)bool: boolean (True,False)string: alphanumeric (a,b,c...)
Columns can contain different data types, however, the data type should be consistent within the column.
DataFrame objects are structured with:
.df: contains the data of theDataFrameobject, stored in a dictionary..colindex: contains the column names, stored in a list of strings..rowindex: contains the row names, stored in a list of strings.
The data of DataFrame objects can be accessed by:
df[1]ordf['C1']
A dictionary-style way to call columns, using the column numerical index or column name.
Column contents are returned in a numpy array..get_row(1)or.get_row('R1')
Method to call rows, using the row numerical index or column name.
Row contents are returned in a list.
Calculations can be done on DataFrame numerical columns (data types int, float or bool):
.sum(): provides the sum of the elements of each column,.min(): provides the minimum value of the elements of each column,.max(): provides the maximum value of the elements of each column,.mean(): provides the mean value of the elements of each column,.median(): provides the median value of the elements of each column.
Calculation results are returned in a list.
The Jupyter Notebook Demo.ipynb provides some usage examples of DataFrame.
ie_pandas is developed in Python 3.7 and uses the package numpy (version 1.16 at the time of development).
To install the package with Terminal (macOS):
- Navigate to the directory ie_pandas with
cd, - Run
pip install --editable ., - Check if the package has been installed using
conda list.
To import the package in Python, use from ie_pandas import DataFrame.
The environment only_np.yml has been prepared to avoid uncontrolled dependencies during development. Available in /environment, it contains:
- Python | version 3.7.3
- Pip | version 19.0.3
- Numpy | version 1.16.2
- Pytest | version 4.4.1
- Pytest-cov | version 2.6.1
- Black | version 19.3b0
- Jupyter | version 1.0.0
To install the environment in Anaconda with Terminal (macOS), run conda env create -f only_np.yml.
Once the environment is installed in Anaconda, it should be activated with conda activate only_np.
To install the package with Terminal (macOS):
- Navigate to the directory ie_pandas with
cd, - Run
pip install --editable ., - Check if the package has been installed using
conda list.
For modifications to the package code to be considered, re-install the package using pip install -e ..
To import the package in Python, use from ie_pandas import DataFrame.
For testing, install the packages with Terminal (macOS):
- pytest with
pip install pytest, - pytest-cov with
pip install pytest-cov.
To run the tests and check their coverage, run:
pytest --cov=ie_pandas 'test' --cov-report term-missing -vv
Creating a DataFrame by passing a dictionary of lists, letting ie_pandas create default integer indexes:
obj = {'str':['a', 'b', 'c', 'd', 'e'],
'int':[1, 2, 3, 4, 5],
'float':[1.1, 2.2, 3.3, 4.4, 5.5],
'bool':[True, False, True, False, True]}
df = DataFrame(obj)
df
Out[1]: str int float bool
0 a 1 1.1 True
1 b 2 2.2 False
2 c 3 3.3 True
3 d 4 4.4 False
4 e 5 5.5 True
Creating a DataFrame by passing a list of lists, defining column names:
obj = [['a', 'b', 'c', 'd', 'e'],
[1, 2, 3, 4, 5],
[1.1, 2.2, 3.3, 4.4, 5.5],
[True, False, True, False, True]]
df = DataFrame(obj,
colindex = ['STRING', 'INTEGER', 'FLOAT', 'BOOLEAN'])
df
Out[1]: STRING INTEGER FLOAT BOOLEAN
0 a 1 1.1 True
1 b 2 2.2 False
2 c 3 3.3 True
3 d 4 4.4 False
4 e 5 5.5 True
Creating a DataFrame by passing a dictionary of numpy arrays, defining column and row names:
import numpy as np
obj = {'str':np.array(['a', 'b', 'c', 'd', 'e']),
'int':np.array([1, 2, 3, 4, 5]),
'float':np.array([1.1, 2.2, 3.3, 4.4, 5.5]),
'bool':np.array([True, False, True, False, True])}
df = DataFrame(obj,
colindex = ['STRING', 'INTEGER', 'FLOAT', 'BOOLEAN'],
rowindex = ['A', 'B', 'C', 'D', 'E'])
df
Out[1]: STRING INTEGER FLOAT BOOLEAN
A a 1 1.1 True
B b 2 2.2 False
C c 3 3.3 True
D d 4 4.4 False
E e 5 5.5 True
Creating a DataFrame by passing a dictionary of lists as rows, defining column names:
obj = {'REC1':[1, 1, 1, 1, 1],
'REC2':[2, 2, 2, 2, 2],
'REC3':[3, 3, 3, 3, 3],
'REC4':[4, 4, 4, 4, 4]}
df = DataFrame(obj,
axis = 1,
colindex = ['W', 'X', 'Y', 'Z'])
df
Out[1]: W X Y Z
REC1 1 1 1 1
REC2 2 2 2 2
REC3 3 3 3 3
REC4 4 4 4 4