-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Koala is getting increasingly difficult to manage -- re-write? #249
Comments
The current example of use: import logging
from new_koala import ModelCompiler
from new_koala import Model
from new_koala import Evaluator
logging.basicConfig(level=logging.DEBUG)
filename = 'reader.xlsm'
compiler = ModelCompiler()
new_model = compiler.parse_excel_file(filename)
new_model.persist_to_json_file('brad_graph.json.gzip')
reconstituted_model = Model()
reconstituted_model.construct_from_json_file('brad_graph.json.gzip')
# reconstituted_model.draw_graph()
reconstituted_model.build_code()
evaluator = Evaluator(reconstituted_model)
val1 = evaluator.evaluate('First!A2')
print("value 'evaluated' for First!A2 without a formula:", val1)
val2 = evaluator.evaluate('Seventh!C1')
print("value 'evaluated' for Seventh!C1 with a formula:", val2)
val3 = evaluator.evaluate('Ninth!B1')
print("value 'evaluated' for Ninth!B1 with a defined name:", val3)
val4 = evaluator.evaluate('Hundred')
print("value 'evaluated' for Hundred with a defined name:", val4)
val5 = evaluator.evaluate('Tenth!C1')
print("value 'evaluated' for Tenth!C1 with a defined name:", val5)
val6 = evaluator.evaluate('Tenth!C2')
print("value 'evaluated' for Tenth!C2 with a defined name:", val6)
val7 = evaluator.evaluate('Tenth!C3')
print("value 'evaluated' for Tenth!C3 with a defined name:", val7) Current output (defined name "Hundred" is cell Eighth!B1):
|
Features I can see a pathway to support in the new approach:
|
Bradley van Ree <[email protected]> writes:
Is there an appetite to try and use a re-write for Koala moving
forward or is it best this becomes a new project?
I guess it depends on what you think should happen with existing
installs.
1. If you prefer people upgrade, then release this as a new major
version, increment the major version, and add instructions on what
changes are required to upgrade. Consider releasing alpha versions and
beta versions first as appropriate.
2. If you prefer people stay, then create a new project.
My suspicion is option 1 is probably going to be the preferred one.
Also I am curious, if this new rewrite is any faster then before?
--
Brian May <[email protected]>
https://linuxpenguins.xyz/brian/
|
Thanks for the input Brian. In short, I'd be leaning toward something like "option 1" as you've described. But my needs and intentions for a Koala style solution appear to be different from others here which is why I'm asking the question. It depends on how committed people are to the current use cases and what Koala is expected to become. I think I have been observing a movement toward viewing Koala as "replacing a spreadsheet" eg; openpyxl with a "calculate" button. But this is not the direction I've taken. I have been viewing Excel workbooks as definitions of a calculation (a model or part thereof). Where the model is an abstraction that just so happens to be expressed in the form of an Excel spreadsheet. So, in that sense, (as I see it) Koala isn't simply replacing the Excel calculation engine (eg; openpyxl with a "calculate" button) but is more a toolkit which uses Excel as an interface for defining parts or whole mathematical systems which can use Python as a gateway to evaluation. If used in one way it would achieve the purposes of openpyxl with a "go" button. But would offer significantly more for those who are inextricably bound to Excel but need more from their math. Now that I've written that, it really sounds like I've got an independent project... (again, depending on what Koala is understood to be and which direction it is heading) To be 100% honest, I've not been measuring for speed. My sole measure has been to get a solution which satisfies the features of roughly Koala 0.0.31 which can actually be worked on. My assumption has been that increasing speed (and other code progression) will be easier to achieve if;
The question of speed is interesting, though. Which element of the solution needs to be faster Depends on what you need to be faster, and what is meant by fast. I have broken the process down into steps.
After either step 1 and/or 2, the Model can be persisted to/from JSON files. So these steps only really need to be done once (unless the spreadsheet changes). And if we can implement Evaluator.evaluate_function("=Average(Sheet1!A1:A5)"), it may only ever need to be done once because this function could "freestyle"/ad-hoc/"interactively" calculate against the pre-existing model. Step 3, I'm guessing, is actually where you're asking for "speed". Step 1: Reading and translating Excel into PythonI am now using xml.etree.ElementTree to read the interesting parts of the Excel file rather than using custom regex and string processing. Also, providing you don't change the spreadsheet, translating the Excel workbook into a Python compatible state only has to happen once. You can persist the Python compatible state as JSON (I'm using JSON and jsonpickle). And, equally (obvjously?), load the Python compatible state from the JSON. So if you've got a very complex spreadsheet that took ages to translate, I expect it'll be easier/quicker to load from JSON. I'm hoping (for my needs), once the input and output concepts are implemented, the library will be able to load everything associated with one formula (ergo, one specified cell). This will optimize the Python compatible state. I can't see an easy way to optimize reading the Excel file in that situation, but the resulting Python compatible state would be optimized to contain only the cells, defined names, ranges and formulas associated in calculating that specified cell. Maybe this has to happen in tandem with "step 2"..? Step 2: Converting the formulas into PythonThis is reasonably quick because Step 1 has already parsed cells and recorded all the discovered formulas. I'm not convinced the process to tokenize and convert to Python code is a particularly bad bottleneck. Step 3: Evaluating the resultsI have broken the Excellib functions out into their own objects so it's possible to have multiple ways to define a function. There is a trap, Excel is occasionally incorrect in its calculations and people now count on those bugs so we may want strict Excel compatible modes along side mathematically correct modes. Many of the existing Excellib definitions ignore the faults in Excel and so, pedantically, may produce results different to Excel. We may also want implementations in vanilla Python, numpy/pandas, wrapped C++ libraries, and Jython. Currently this element of the process is going to rely on dictionary lookups to extract the values from cells and then package them in a way which the Excellib function can handle (this makes set_value() really easy to implement). As far as I'm aware, dictionaries in Python are optimized. I'm happy if someone wants to alter this paradigm to optimize for speed. But my current focus is on getting a maintainable project running so dicts are well good enough for me ATM. I'm expecting to use dataframes as the structures representing ranges to pass to the Excellib functions. Which aspect of speed are you interested in? |
Equally... There's no reason why the AST interpreting the formulas needs to write Python code. It can write any code you like. BUT the code it writes does need align with a Model object in the target language, the "Excellib" function library in the target language, and have a Evaluator object written in the target language. |
hello @bradbase, nice recap. As you can noticed, i'm not really active on koala and I'm really open to how it should be re-thought and maintained. Would your project be open-source ? |
Hi @vallettea, Thank you for your input. It is most appreciated. Hopefully my discussion here will answer questions from others as well. My understanding of koala has come from the doco. The presentation, the abstract found on the anthill site and the other writing in the docs directory along side reading the code. Knowing the history, and having faced the same challenges, explains everything. And as I've formed an understanding I've seen great inspiration. I absolutely agree with you about Excel being sub-optimal for defining a mathematical model. The limitations are excruciating. I hate it too. Finding ways around Excel has lead me to become an "expert"(/?) in it. Over the last 10 years I have been working as an analyst is the energy sector, programmer and a consulting to a demographics company with some time spent in telephone billing (accounts receivable - read big data). The skills these companies have valued the most has been in the space of mathematical modelling and communicating mathematical modelling. What I have observed is; each case the domain expert is usually not skilled in programming. This causes some challenges;
These challenges are compounded by the domain expert's time being too valuable for them to sufficiently learn programming. So companies turn to hiring analysts, programmers and contractors to get what they need from the investment they've made in the math expressed in Excel. When they learn it'll take 3 years to translate Excel to code which will tie them to a new world of technical debt and still not assure them of correctness and accuracy, it's deemed unworthy. Microsoft are trying to alleviate this problem taking Excel to the cloud. This may (or may not) eventuate in helping. Faster processing of Excel is only part of the answer. What is really useful is a common language to express math which can be understood by engineers, managers and programmers both within and across companies. The most important of all these positions, however, are auditors. The question of proof that the audited and approved Excel model has been precisely translated into another language so often comes down to "trust me" which is obscene. Accepting the significant deficiencies, currently Excel is the one language of expression which "everyone" knows. So, I have seen that using Koala can resolve the above challenges. They keys are:
The advantage of above has already been exampled in my project FlyingKoala. I can accept koala has been written to essentially replace the Excel calculation engine to gain speed and remove the translation problem. Eg; feature compatible with Excel, just faster -- openpyxl with a "go" button. But the features koala had (around 0.0.31 and before) were properly genius. Where you could;
These are the things that turn koala from "Excel but faster" into a really powerful tool to leverage domain experts and ease communication between all stakeholders. Sadly much of this was broken post version 0.0.31. My interests are such:
The best worked example of the principles I have documented: https://flyingkoala.readthedocs.io/en/latest/worked_example_horticulture.html Thanks again for your questions and input. |
Once the one-click installer is up, I'm will develop commercial solutions for companies to leverage FlyingKoala with PLPython and Python within MS products (eg; MSSQL Server and PowerBI). |
For those following: First order approximation of Range support using Dataframes as the unit for calculation in the "Excel_lib" functions (currently only SUM).
|
@vallettea and @brianmay and anyone lurking, the code I've written so far; https://github.com/bradbase/koala_xlcalculator It's not yet feature compatible with koala, but will need to be for it to be useful to anyone. EDIT: koala_xlcalculator is now known as xlcalculator and can be found here: |
Hello @bradbase I'm quite impressed by your work and I really think that there is an opportunity in this kind of product especially with someone like you who knows the internal needs of the sector. I haven't had time to benchmark flyingkoala but i think that if is does what does koala2 with pretty good performances, we should consider telling people to use flyingkoala instead. |
Hello @vallettea No problem on timing. The world is in considerable turmoil at the moment and these projects are the least of anyone's worries. The time you've taken to respond is especially appreciated. (for posterity COVID-19 era) Your feedback means a great deal. Thank you. In case our timing doesn't align and to be a pedant: FlyingKoala 0.0.3b is currently associated with Koala2 0.0.31. The code I've put forward as potential inspiration for "Koala2 re-write" is currently found as koala_xlcalcualtor and intended for eventual use with FlyingKoala . It is currently a long way from being feature complete against Koala2. I am working on that as a priority. For those who tread here: It's sounding like koala_xlcalcualtor may be on collision course to be re-named Koala3 once I (or the royal "we"?) reach feature parity. |
For those lurking: I have just release FlyingKoala 0.0.4b0. It's got a bunch of bugfixes and changes to make the add-in optional and actually compatible with Koala2 0.0.31. |
Hi Guys, We will start with Cryptomarkets but the idea could be easily translated for traditional markets. Ex. Buy Bitcoin with USD if BTC price is Up 2% within 45 Min...then... which MUST be written in a language compatible with the rest of the infrastucture. Ex. Do the previus buy only if the price is above the Max price of last 3 days. Do not execute the rule on friday... We end up with the only option: The final user needs to know at least basic excel logic. Problem is... can excel be wrapped in our infrustructure as it is ??? Ciao |
@LFoglietti, Thank you for your support and validation. I have a feeling the mechanics for the solution you are talking about may be encapsulated in FlyingKoala as it is a combination of two projects, xlwings and Koala2. The intent of FlyingKoala is to enable people like your target market to put systems like this together for themselves. Maybe the scale you are talking about requires assistance but I like to think the basics of defining the individual algorithms and setting up a set of rules would be achievable by the traders and analysts themselves. A worked example using two formula from horticulture can illustrate the concepts. If you are in financial analytics and don't know about xlwings it's worth becoming familiar. Here is their website. xlwings is a solution coming from the financial trading analytics world. At its core it is a COM wrapper but is incredibly inspired. The practicality of the solution provides two-way communication between Python and Excel on all important facets. I have found great use for xlwings well beyond finance and even well beyond quantitative analytics into qualitative analytics and even commercial data management. When coupling xlwings with Koala... Things become surprisingly powerful. Disclaimer: I am not affiliated with Zoomer Analytics or xlwings. The only involvement I have had with xlwings to date (besides using it) is I pioneered Anaconda support in the xlwings project. |
@bradbase, I looked into your worked example. Let's assume my analysis' result is: In terms of excel, every time I get a new data, it would be: This would be the difficult part because: Devil is often in details: like copy and pasting array formulas... Pm me [email protected] if you are interested in the project. Ciao |
@LFoglietti Thank you for the extension of the example. I have used Koala2 <= 0.0.31 in the way you describe. Using Koala2 version <= 0.0.31, or when feature compatible the re-write koala_xlcalculator, you can save the state. In Koala2 0.0.31 you can specify to "focus" on a particular output cell so the resulting state only relates to the cells involved with the formula you are interested in. This provides you with the ability to read the definition of a formula, convert it to Python and persist that as a "math object". The input terms are still referred by their cell address (or defined names), but they are the inputs to the equation which gets evaluated in Python. This "math object" can then be used either as a lambda (in a programming sense) or in a lambda (in a cloud processing sense). All the above can happen completely independently of a Microsoft environment. The FlyingKoala solution joins Koala2 and xlwings. xlwings is there to to provide;
FlyingKoala actually uses the "math object" essentially as a lambda (in the programmatic sense). There is absolutely no reason why you need Excel except for defining the formulas. Once the formulas are defined you can then apply the formula to any data you have. I have put introspection code into Koala2 so that a "math object" can be queried for its terms (input parameters). The method is ExcelParser.getOperandRanges(). The code in FlyingKoala.py demonstrates the use of this. FlyingKoala is literally only 185 lines of code. It's easy to see how it works. The "money-shot" is in a combination of the methods generate_model_graph() and evaluate_koala_model(). ** I keep specifying Koala2 0.0.31 because after that version there was a shift in the code which broke the ability to persist the model and to properly use the output cell mechanism. ** Also, I have been remiss to mention other solutions. If you don't already know about them, checkout PyCell, PyXll and the pair of solutions Formula and Schedula. |
@bradbase I agree with you. In my opinion the problem can be divided in to:
Just to give you an idea: our main evaluation guidelines are:
At the moment we have a Beta working in c# .Net. |
@LFoglietti It is marvelous communicating with you. So few people understand these things at such a level. I will PM you with respect to this as I know I'm losing some detail. As we have gone so deep I think it may be good to have some of this conversation in the open for those lurking. We are genuinely at the core. I've had a cursory look at FlexCel user guide. It's currently late where I am and I've spent a lot of hours today coding functions and tests (which is absolutely boring and mind-numbing) so I can't take it in as I'd like. I'll look again after some rest. That said; Koala2 in its "pure" form (or, maybe the concept) might be a great start as a Python answer to FlexCel. In my opinion (happy to be corrected by the Koala gurus here) Koala2 currently focuses on the idea of an Excel replacement. There is focus on being able to read and write Excel files as well as calculate them. The library being used for managing Excel files is OpenPyXL which is a gold standard for Excel files in Python (there are some others which specifically define for reading or writing). Openpyxl, however, has two deficiencies that I've identified (there may be more);
In reality, unless you are interactive with the spreadsheet, who cares about pivot tables...
Calculation has been solved by Koala, PyCell and Formula/Schedula and soon to be Koala_xlcalculator.
Koala is in the sweet spot for Excel file management + calculation. My needs are different from that of Koala's direction in that I don't care to manage Excel files. I'm trying to get the math from a domain expert to transposing data without need of a developer. Whether that data is on the desktop, on a DB or in the cloud. I find one major challenge with Koala and that is the code is very difficult to manage. It's completely understandable due to the heritage of the project, but there's a great deal of clean-up to be done. My figuring is a clean calculator module will assist and may well make integration with openpyxl, xlrd or similar easier to achieve. And that feeling seems to be validated. I've spent less time writing a "new" Koala (leaning heavily on Koala) than I have trying to fix things in the existing code and I've gotten much further. The reality is there is need for both Koala and FlyingKoala as they address different yet heavily related problem domains. I'm yet to fully get my head around FlexCel to appreciate where it lives but the quickstart introduction appears to show intent to be similar to Koala2. |
In Flexcell there is also a tool for automatically showing needed Delphi/C++ code for generating specific .XLS/.XLSX file cells. It reminds the macro recorder in Excel. You upload a file and you get the C code to replicate it. My considerations, so far, are that Python, at this stage, canNOT compete with other languages/Environment like C#.Net. as far as transforming Excel in to a "Python Math object" is concerned (2). Also considering Java, you can find excellent commercial solution like the Grapecity family products. My question is.
Contributions will be appreciated. |
Outside technology... My strength as a software developer is to promote to non-developers capabilities that they need but can't get access to without my skill. The more I empower non-developers the more value I am to them. I am of most value if I can facilitate them to empower themselves without need to come into direct contact with me. The solution I'm championing is not solely addressing a technical problem. The problem space is a combination of personal skills and inter-human communications. That said, from a technical perspective; You are 100% correct in suggesting other languages may be better at various things with respect to interacting with Excel files and replicating Excel's functionality. It would be marvelous if those other languages had some of the strengths of Python but each language has succeeded for their own reasons. I disagree with the statement that there's no request for such an ability in Python. Koala exists. It had a reason. With its existence there's a sustained enthusiastic following. PyXll is making money from this. xlwings exists. PyCel, Formula/Schedula. Pyopenxl and xlrd both exist because people are trying to interact with Excel files in Python. The reason the libraries you speak of exist in those other languages is the same which demand one in Python. What happens when your workforce doesn't have the skills in Java or C#.net? Even if they do, what happens when development is not the core purpose of your business and Java or C#.net environments are not going to be authorized?
There are many, many reasons why Python would want this capability. The strongest need is outside the scope of math literate people who can code. I am finding the struggle is the only people who understand the technology are math literate analysts and developers. Obviously they don't need it. The people who can benefit from this tech have their hands tied behind their back because they aren't necessarily code literate. Do you (or I) want to learn how Excel does its calculations and knowingly implement incorrect calculation..? Absolutely not. But there have been people who have done that for the libraries and tools you've spoken about. Why does Python need to miss out? Although it may not get much traction I see there is value in having a library of objects which calculate in the same way as Excel. Again with auditing in mind. If your Python code (or even Java, C#.net) can't produce the same answers as the Excel model which has been authorized by the bank you are introducing risk of liability. Currently I don't know of a single tool which can categorically prove a complex mathematical calculation differs between Excel and a programming language due to Excel's deficiencies. For that you need excel compatible/non-compatible "modes". The way to do that is write objects which implement multiple methods addressing "Excel compatibility" or straight/correct calculation and switch between them. There is opportunity for that in the way I've structured koala_xlcalculator. |
@bradbase , Having said that. I am talking about an essential piece of software in a commercial app. Used by thousands of people. |
This is probably hijacking the intent of this thread. To find the answer it's a case of reading the documentation of each of the projects and seeing for yourself whether they align. I've read the doco for FlexCel and am confused as to where it fits. I'm happy to continue this conversation, but off-channel somewhere. |
It was a very nice conversation and you are right: we are getting out of topic. Assuming I can manipualte Excel using another language ( C#, Java, Python...) we have 3 main scenarios for a simple worksheet where A1=5, A2=10 A3=Sum(A1:A2) =15 1) Read & Write 2) Read & Write & UDF 3) Complete Integration In Python: Koala, Koala2, FlyingKoala, Formula are library going towards a complete integration. Will Python continue its efforts towards a complete integration or it s happy to stay in scenario 2 |
Koala is striving for option 3 and, to large extent, already demonstrates the achievement of this goal. Will it continue to..? I expect so. Python cannot really achieve options 1 or 2 as it's not possible to integrate the language at a deep enough level. Currently those options are faked using COM. In the future we are likely to be able to integrate at that level through the web based API (REST?). Excel is heading toward the cloud in a very serious way and that is the interaction pathway becoming available. |
I forgot to mention the following for Python. Thankyou @bradbase for your analisys and good luck with your project. |
@vallettea, @brianmay, @strichter, @kmonson and others, I have written and now released xlcalculator. It is heavily based on Koala2 and is written for Python 3. I have done a performance comparison between PyCel, Koala2 0.0.33 and xlcalculator. The performance scripts can be found in this repo. I intend to write a software review to comprehensively explore at least PyCel, Koala2 and xlcalculator but in short what I've been seeing;
xlcalculator is now the back-end calculator for FlyingKoala and is exceptionally performant. A note: I have extracted the Excel function implementations into a library on their own, so anyone can use xlfunctions to run Excel compatible calculations. The only thing now is to get people to use it so I can find all the bugs :D |
Congrats ! Looks awesome. |
@LFoglietti Has Tradestation the trade rule interface you're looking for? |
Hi Everyone,
I have a vested interest in increasing Koala's maintainability. My project FlyingKoala relies on Koala and I am trying to start a business off the back of it. Some of changes to Koala in the last 12 months, which have helped move Kola forward, have removed features I rely upon and have made it very difficult to re-instate the features I need.
I am in a situation where I have been putting effort into re-writing Koala so we (/I) can have a modernized code base to work from while supporting the features we all want and making the code maintainable.
The original intent was to just remove all the things that are in there for no reason, and keep as much as possible. This was quickly proven not realistically possible. So I've used the "good" bits -- the stuff that does the AST and evaluation -- and have been slowly re-writing the surrounding code.
I have a version now which supports:
I am about 50% through supporting ranges. Ranges are being recognized in defined names and formulas and are being resolved to collections of cell addresses. I am today beginning work on having those collections of cell addresses packaged and passed to the formula functions.
Once Ranges are supported, the basics of Koala's features will have been successfully re-written. I have ignored OFFSET and INDEX for now. I am yet to test (or consciously support) unicode.
The ignore, input/output are not yet supported but need to be for feature parity (and are unbelievably useful).
And I'm yet to implement a method to set value.
Besides re-using a small amount of code for parsing formulas and executing the subsequent Python code, this is essentially a completely new codebase.
Is there an appetite to try and use a re-write for Koala moving forward or is it best this becomes a new project?
I'm happy to make the code available for critique.
The text was updated successfully, but these errors were encountered: