Skip to content

Commit

Permalink
New version after major refactoring
Browse files Browse the repository at this point in the history
  • Loading branch information
JavaScriptDude authored May 5, 2022
1 parent 5aa62af commit bc20ec9
Show file tree
Hide file tree
Showing 9 changed files with 467 additions and 132 deletions.
58 changes: 30 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,27 +12,29 @@ python3 -m pip install multisort
None

### Performance
Average over 10 iterations with 500 rows.
Average over 10 iterations with 1000 rows.
Test | Secs
---|---
cmp_func|0.0054
pandas|0.0061
reversor|0.0149
msorted|0.0179
superfast|0.0005
multisort|0.0035
pandas|0.0079
cmp_func|0.0138
reversor|0.037

As you can see, if the `cmp_func` is by far the fastest methodology as long as the number of cells in the table are 500 rows for 5 columns. However for larger data sets, `pandas` is the performance winner and scales extremely well. In such large dataset cases, where performance is key, `pandas` should be the first choice.
Hands down the fastest is the `superfast` methdology shown below. You do not need this library to accomplish this as its just core python.

The surprising thing from testing is that `cmp_func` far outperforms `reversor` which which is the only other methodology for multi-columnar sorting that can handle `NoneType` values.
`multisort` from this library gives reasonable performance for large data sets; eg. its better than pandas up to about 5,500 records. It is also much simpler to read and write, and it has error handling that does its best to give useful error messages.

### Note on `NoneType` and sorting
If your data may contain None, it would be wise to ensure your sort algorithm is tuned to handle them. This is because sorted uses `<` comparisons; which is not supported by `NoneType`. For example, the following error will result: `TypeError: '>' not supported between instances of 'NoneType' and 'str'`.
If your data may contain None, it would be wise to ensure your sort algorithm is tuned to handle them. This is because sorted uses `<` comparisons; which is not supported by `NoneType`. For example, the following error will result: `TypeError: '>' not supported between instances of 'NoneType' and 'str'`. All examples given on this page are tuned to handle `None` values.

### Methodologies
Method|Descr|Notes
---|---|---
cmp_func|Multi column sorting in the model `java.util.Comparator`|Fastest for small to medium size data
reversor|Enable multi column sorting with column specific reverse sorting|Medium speed. [Source](https://stackoverflow.com/a/56842689/286807)
msorted|Simple one-liner designed after `multisort` [example from python docs](https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts)|Slowest of the bunch but not by much
multisort|Simple one-liner designed after `multisort` [example from python docs](https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts)|Second fastest of the bunch but most configurable and easy to read.
cmp_func|Multi column sorting in the model `java.util.Comparator`|Reasonable speed|Enable multi column sorting with column specific reverse sorting|Medium speed. [Source](https://stackoverflow.com/a/56842689/286807)
superfast|NoneType safe sample implementation of multi column sorting as mentioned in [example from python docs](https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts)|Fastest by orders of magnitude but a bit more complex to write.




Expand All @@ -49,39 +51,39 @@ rows_dict = [
]
```

### `msorted`
### `multisort`
Sort rows_dict by _grade_, descending, then _attend_, ascending and put None first in results:
```
from multisort import msorted
rows_sorted = msorted(rows_dict, [
('grade', {'reverse': False, 'none_first': True})
from multisort import multisort
rows_sorted = multisort(rows_dict, [
('grade', {'reverse': False})
,'attend'
])
```

Sort rows_dict by _grade_, descending, then _attend_ and call upper() for _grade_:
```
from multisort import msorted
rows_sorted = msorted(rows_dict, [
('grade', {'reverse': False, 'clean': lambda s:None if s is None else s.upper()})
from multisort import multisort
rows_sorted = multisort(rows_dict, [
('grade', {'reverse': False, 'clean': lambda s: None if s is None else s.upper()})
,'attend'
])
```
`msorted` parameters:
`multisort` parameters:
option|dtype|description
---|---|---
`key`|int or str|Key to access data. int for tuple or list
`spec`|str, int, list|Sort specification. Can be as simple as a column key / index
`reverse`|bool|Reverse order of final sort (defalt = False)

`msorted` `spec` options:
`multisort` `spec` options:
option|dtype|description
---|---|---
reverse|bool|Reverse sort of column
clean|func|Function / lambda to clean the value
none_first|bool|If True, None will be at top of sort. Default is False (bottom)
clean|func|Function / lambda to clean the value. These calls can cause a significant slowdown.
required|bool|Default True. If false, will substitute None or default if key not found (not applicable for list or tuple rows)
default|any|Value to substitute if required==False and key does not exist or None is found. Can be used to achive similar functionality to pandas `na_position`



Expand Down Expand Up @@ -134,7 +136,7 @@ rows_obj = [
]
```

### `msorted`
### `multisort`
(Same syntax as with 'dict' example)


Expand Down Expand Up @@ -177,11 +179,11 @@ rows_tuple = [
(COL_IDX, COL_NAME, COL_GRADE, COL_ATTEND) = range(0,4)
```

### `msorted`
### `multisort`
Sort rows_tuple by _grade_, descending, then _attend_, ascending and put None first in results:
```
from multisort import msorted
rows_sorted = msorted(rows_tuple, [
from multisort import multisort
rows_sorted = multisort(rows_tuple, [
(COL_GRADE, {'reverse': False, 'none_first': True})
,COL_ATTEND
])
Expand Down Expand Up @@ -218,6 +220,6 @@ rows_sorted = sorted(rows_tuple, key=cmp_func(cmp_student), reverse=True)
### Tests / Samples
Name|Descr|Other
---|---|---
tests/test_msorted.py|msorted unit tests|-
tests/test_multisort.py|multisort unit tests|-
tests/performance_tests.py|Tunable performance tests using asyncio | requires pandas
tests/hand_test.py|Hand testing|-
20 changes: 20 additions & 0 deletions dot.vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"console": "integratedTerminal",
"justMyCode": true,
// "program": "tests/hand_test.py",
// "program": "tests/performance_tests.py",
// "program": "tests/perf_tests_2.py",
"program": "tests/test_multisort.py",
// "args": ["DictTests.test_list_of_dicts"]
}
]
}
6 changes: 6 additions & 0 deletions dot.vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"python.envFile": "${workspaceFolder}/dev.env",
"python.linting.pylintEnabled": false,
"python.linting.flake8Enabled": true,
"python.linting.enabled": true
}
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "multisort"
version = "0.1.1"
version = "0.1.2"
description = "NoneType Safe Multi Column Sorting For Python"
license = "MIT"
authors = ["Timothy C. Quinn"]
Expand Down
2 changes: 1 addition & 1 deletion src/multisort/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from .multisort import msorted, cmp_func, reversor
from .multisort import multisort, cmp_func, reversor
140 changes: 86 additions & 54 deletions src/multisort/multisort.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
cmp_func = cmp_to_key


# .: msorted :.
# .: multisort :.
# spec is a list one of the following
# <key>
# (<key>,)
Expand All @@ -21,70 +21,102 @@
# <opts> dict. Options:
# reverse: opt - reversed sort (defaults to False)
# clean: opt - callback to clean / alter data in 'field'
# none_first: opt - If True, None will be at top of sort. Default is False (bottom)
class Comparator:
@classmethod
def new(cls, *args):
if len(args) == 1 and isinstance(args[0], (int,str)):
_c = Comparator(spec=args[0])
def multisort(rows, spec, reverse:bool=False):
key=clean=rows_sorted=default=None
col_reverse=False
required=True
for s_c in reversed([spec] if isinstance(spec, (int, str)) else spec):
if isinstance(s_c, (int, str)):
key = s_c
else:
_c = Comparator(spec=args)
return cmp_to_key(_c._compare_a_b)
if len(s_c) == 1:
key = s_c[0]
elif len(s_c) == 2:
key = s_c[0]
s_opts = s_c[1]
assert not s_opts is None and isinstance(s_opts, dict), f"Invalid Spec. Second value must be a dict. Got {getClassName(s_opts)}"
col_reverse = s_opts.get('reverse', False)
clean = s_opts.get('clean', None)
default = s_opts.get('default', None)
required = s_opts.get('required', True)

def __init__(self, spec):
if isinstance(spec, (int, str)):
self.spec = ( (spec, False, None, False), )
else:
a=[]
for s_c in spec:
if isinstance(s_c, (int, str)):
a.append((s_c, None, None, False))
else:
assert isinstance(s_c, tuple) and len(s_c) in (1,2),\
f"Invalid spec. Must have 1 or 2 params per record. Got: {s_c}"
if len(s_c) == 1:
a.append((s_c[0], None, None, False))
elif len(s_c) == 2:
s_opts = s_c[1]
assert not s_opts is None and isinstance(s_opts, dict), f"Invalid Spec. Second value must be a dict. Got {getClassName(s_opts)}"
a.append((s_c[0], s_opts.get('reverse', False), s_opts.get('clean', None), s_opts.get('none_first', False)))

self.spec = a

def _compare_a_b(self, a, b):
if a is None: return 1
if b is None: return -1
for k, desc, clean, none_first in self.spec:
def _sort_column(row): # Throws MSIndexError, MSKeyError
ex1=None
try:
try:
va = a[k]; vb = b[k]
v = row[key]
except Exception as ex:
va = getattr(a, k); vb = getattr(b, k)

except Exception as ex:
raise KeyError(f"Key {k} is not available in object(s) given a: {a.__class__.__name__}, b: {a.__class__.__name__}")
ex1 = ex
v = getattr(row, key)
except Exception as ex2:
if isinstance(row, (list, tuple)): # failfast for tuple / list
raise MSIndexError(ex1.args[0], row, ex1)

if clean:
va = clean(va)
vb = clean(vb)
elif required:
raise MSKeyError(ex2.args[0], row, ex2)

if va != vb:
if va is None: return -1 if none_first else 1
if vb is None: return 1 if none_first else -1
if desc:
return -1 if va > vb else 1
else:
return 1 if va > vb else -1
if default is None:
v = None
else:
v = default

if default:
if v is None: return default
return clean(v) if clean else v
else:
if v is None: return True, None
if clean: return False, clean(v)
return False, v

try:
if rows_sorted is None:
rows_sorted = sorted(rows, key=_sort_column, reverse=col_reverse)
else:
rows_sorted.sort(key=_sort_column, reverse=col_reverse)


except Exception as ex:
msg=None
row=None
key_is_int=isinstance(key, int)

if isinstance(ex, MultiSortBaseExc):
row = ex.row
if isinstance(ex, MSIndexError):
msg = f"Invalid index for {row.__class__.__name__} row of length {len(row)}. Row: {row}"
else: # MSKeyError
msg = f"Invalid key/property for row of type {row.__class__.__name__}. Row: {row}"
else:
msg = ex.args[0]

raise MultiSortError(f"""Sort failed on key {"int" if key_is_int else "str '"}{key}{'' if key_is_int else "' "}. {msg}""", row, ex)


return reversed(rows_sorted) if reverse else rows_sorted


return 0
class MultiSortBaseExc(Exception):
def __init__(self, msg, row, cause):
self.message = msg
self.row = row
self.cause = cause

class MSIndexError(MultiSortBaseExc):
def __init__(self, msg, row, cause):
super(MSIndexError, self).__init__(msg, row, cause)

class MSKeyError(MultiSortBaseExc):
def __init__(self, msg, row, cause):
super(MSKeyError, self).__init__(msg, row, cause)

def msorted(rows, spec, reverse:bool=False):
if isinstance(spec, (int, str)):
_c = Comparator.new(spec)
else:
_c = Comparator.new(*spec)
return sorted(rows, key=_c, reverse=reverse)
class MultiSortError(MultiSortBaseExc):
def __init__(self, msg, row, cause):
super(MultiSortError, self).__init__(msg, row, cause)
def __str__(self):
return self.message
def __repr__(self):
return f"<MultiSortError> {self.__str__()}"

# For use in the multi column sorted syntax to sort by 'grade' and then 'attend' descending
# dict example:
Expand Down
Loading

0 comments on commit bc20ec9

Please sign in to comment.