Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lua filter #110

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Add lua filter #110

wants to merge 21 commits into from

Conversation

arl
Copy link
Collaborator

@arl arl commented Dec 9, 2020

❓ What

Add a LUA filter that allows to add filter logic (implement the Process function) in lua.
Needs further testing/benchmarking/polishing but first results shows a relatively good performance with respect to the same filter coded in Go (5% slower for a filter with a single lua operation).

Every Lua baker filter compiles a given script and will run a user-specific function (which should have the following prototype):

-- rec is a record object
-- next is function next(record)
function myLuaFilterFunction(rec, next)

lua script knows the record type which is the lua-equivalent of the Record Go interface. record has the following methods:

  • get(int) string
  • set(int, string)
  • clear()
  • copy() record

In the lua context (called lua state), some utility functions are already defined and can be used:

  • validateRecord(record) bool, int
  • createRecord record
  • fieldByName(string) int|nil different from Go which returns (baker.FieldIndex, bool)
  • fieldNames which is a lua table, same as fieldNames Go slice

🔨 How to test

  1. List all steps necessary;
  2. To test this pull request.

✅ Checklists

This section contains a list of checklists for common uses, please delete the checklists that are useless for your current use case (or add another checklist if your use case isn't covered yet).

  • Is there unit/integration test coverage for all new and/or changed functionality added in this PR?
  • Have the changes in this PR been functionally tested?
  • Have new components been added to the related all.go files?
  • Have new components been added to the documentation website?
  • Has make gofmt-write been run on the code?
  • Has make govet been run on the code? Has the code been fixed accordingly to the output?
  • Have the changes been added to the CHANGELOG.md file?
  • Have the steps in CONTRIBUTING.md been followed to update a Go module?

@arl arl force-pushed the add-lua-filter branch 2 times, most recently from 81400af to 8b1247f Compare December 12, 2020 12:15
@codecov-io
Copy link

codecov-io commented Dec 12, 2020

Codecov Report

Merging #110 (e5da03b) into main (3ff24dd) will increase coverage by 0.79%.
The diff coverage is 82.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #110      +/-   ##
==========================================
+ Coverage   49.43%   50.22%   +0.79%     
==========================================
  Files          56       57       +1     
  Lines        3880     3976      +96     
==========================================
+ Hits         1918     1997      +79     
- Misses       1752     1764      +12     
- Partials      210      215       +5     
Impacted Files Coverage Δ
filter/lua.go 82.29% <82.29%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ff24dd...e5da03b. Read the comment docs.

@arl arl changed the base branch from main to fieldname-as-slice December 13, 2020 21:11
@arl arl marked this pull request as ready for review December 27, 2020 10:24
@arl arl force-pushed the fieldname-as-slice branch 2 times, most recently from 7c42f57 to f22308b Compare January 7, 2021 11:51
Base automatically changed from fieldname-as-slice to main January 7, 2021 17:38
@arl arl requested a review from tommyblue March 3, 2021 18:37
filter/lua.go Outdated Show resolved Hide resolved
filter/lua.go Outdated Show resolved Hide resolved
filter/lua.go Outdated Show resolved Hide resolved
filter/lua.go Outdated Show resolved Hide resolved
filter/lua.go Show resolved Hide resolved
filter/lua.go Outdated Show resolved Hide resolved
filter/lua.go Show resolved Hide resolved
filter/lua.go Outdated Show resolved Hide resolved

This filter is based on [GopherLua](github.com/yuin/gopher-lua), which is an LUA5.1 virtual machine.

To use this filter you need to declare a function in an lua file. This function serves the same purpose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's worth telling the user that the function name must be the same as FilterName conf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see how can the user could be confused here since the config h as 2 fields: a filename and a function name.

filter/lua.go Outdated Show resolved Hide resolved
filter/lua.go Outdated
func (t *LUA) Process(rec baker.Record, next func(baker.Record)) {
atomic.AddInt64(&t.nprocessed, 1)

t.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure to lock all the Process functions? This means that even if the Lua function is pure it will not exploits any concurrent execution on different disjoined records. Moreover, it will imply that adding a single LuaFilter on a possibly long and complex filter-chain will inhibit any parallel execution, blocking all the execution on a single lock.
IHMO we need to create a new state each time to avoid the lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlocking before next would solve the filterchain parallelization issue, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refocus the review on correctness and semantics of the user-api.
Creating a new state each time is probably quite expensive, it might be beneficial in some cases, in might not be in others and I'm not sure one can be certain it's the way to go to obtain the best overall performance in a majority of cases.

Happy to discuss optimizations later once we get the API right though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlocking before next would solve the filterchain parallelization issue, no?

the user may not call next so I'm not sure I understand

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the lua function is in a file, the user could have written some statements before the function. This statement may be needed for the filter to work for example. The current implementation allows that by creating the state once and locking it each time any goroutine enters the lua filter. Recreating the state each time, apart from the potential cost, removes this possibility, since the lua file would need to be reinterpreted each time, that means for each record, at least if I correctly understood what you meant.

Another possible API I thought of was to ask the user to code the lua function directly in the TOML, but you lose syntax highlighting when coding in lua, you'll need to escape your code, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it will be better to ask the user to write pure functions, actually, I will see strange the other way around. Moreover, we could use this approach, sharing-lua-byte-code-between-lstates, to compile one time the Lua file and use it with different states. Although, it is thought that calling NewState each time seems a bit too costly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to refocus the discussion a bit.

The idea behind the lua filter is to lower the bar in using baker and coding some filters fast. If I have to code pure lua functions, that means I won't be able to do simple stuff one could expect to be able to do in a filter, such as counting records for example.

Apart from being faster (which, as always with performance, would needs to be proved with numbers) I'm not sure I see any other benefit in requiring the user to write pure functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we have to clarify, before going into the implementation details, how we want to execute the LUA code of the user. I see two possible alternatives:

  1. sequentially execute the LUA code on the records
  2. parallelize the LUA code on multiple records

The two alternatives have pros and cons and both could drastically change both the performance and the expressiveness of the user. We could also think of a configurable behavior of the filter, but in both cases, the user must be aware of what he/she can and cannot do.

Copy link
Collaborator Author

@arl arl Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in any case the user must know the tradeoffs that have been made so they can chose if they match their use case so those have to be documented.

Coming back to @tommyblue comments about the position of unlock, it seems right to release the lock before calling next to allow concurrent execution of the rest of the filter chain. Will fix this and perform the benchmark/tests with this 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to offload the locking to the user inside the Lua code? This way they can indicate if they need synchronization or not.

Suppress one allocation, reducing their number from 5 to 4, by
pre-allocating the userdata we use to wrap the record passed to
the lua filter.

    name          old time/op    new time/op    delta
    LUAProcess-8     638ns ± 4%     559ns ± 6%  -12.31%  (p=0.000 n=10+10)

    name          old alloc/op   new alloc/op   delta
    LUAProcess-8      152B ± 0%      104B ± 0%  -31.58%  (p=0.000 n=10+10)

    name          old allocs/op  new allocs/op  delta
    LUAProcess-8      5.00 ± 0%      4.00 ± 0%  -20.00%  (p=0.000 n=10+10)
Reduce allocation from 4 to 2 by preallocating the lua function
which is a wrapper to the Go 'next' function we receive.

    name          old time/op    new time/op    delta
    LUAProcess-8     559ns ± 6%     459ns ± 7%  -17.97%  (p=0.000 n=10+10)

    name          old alloc/op   new alloc/op   delta
    LUAProcess-8      104B ± 0%       24B ± 0%  -76.92%  (p=0.000 n=10+10)

    name          old allocs/op  new allocs/op  delta
    LUAProcess-8      4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
@codecov-commenter
Copy link

Codecov Report

Merging #110 (68a1060) into main (40a57bc) will decrease coverage by 3.75%.
The diff coverage is 81.25%.

❗ Current head 68a1060 differs from pull request most recent head 4fe5623. Consider uploading reports for the commit 4fe5623 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #110      +/-   ##
==========================================
- Coverage   53.95%   50.20%   -3.76%     
==========================================
  Files          60       57       -3     
  Lines        4003     3976      -27     
==========================================
- Hits         2160     1996     -164     
- Misses       1646     1764     +118     
- Partials      197      216      +19     
Impacted Files Coverage Δ
filter/lua.go 81.25% <81.25%> (ø)
filter/set_string_from_url.go 37.50% <0.00%> (-19.03%) ⬇️
topology.go 60.89% <0.00%> (-13.41%) ⬇️
config.go 65.00% <0.00%> (-13.00%) ⬇️
filter/metadata_lastmodified.go 76.19% <0.00%> (-12.70%) ⬇️
input/list.go 60.78% <0.00%> (-10.51%) ⬇️
filter/notnull.go 70.58% <0.00%> (-9.42%) ⬇️
filter/clear_fields.go 56.25% <0.00%> (-8.04%) ⬇️
filter/partial_clone.go 65.00% <0.00%> (-7.23%) ⬇️
filter/string_match.go 69.56% <0.00%> (-6.63%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a209176...4fe5623. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants