Skip to content

Latest commit

 

History

History
298 lines (214 loc) · 10.7 KB

README.md

File metadata and controls

298 lines (214 loc) · 10.7 KB

Stuart

(He's little). A pure Lua rewrite of Apache Spark 2.2.0, designed for embedding and edge computing.

Build Status

Contents

Installation

$ luarocks install stuart

Usage

Reading a text file

Create a "Stuart Context", then count the number of lines in this README:

$ lua
Lua 5.2.4  Copyright (C) 1994-2015 Lua.org, PUC-Rio

local SparkContext = require 'stuart.Context'
local sc = SparkContext:new()
local rdd = sc:textFile('README.md')
print(rdd:count())
151

Working with lists of values

rdd = sc:parallelize({1,2,3,4,5,6,7,8,9,10}, 3)
filtered = rdd:filter(function(x) return x % 2 == 0 end)
print('evens: ' .. table.concat(filtered:collect(), ','))
evens: {2,4,6,8,10}

Working with lists of pairs

rdd = sc:parallelize({{4,'Gnu'}, {4,'Yak'}, {5,'Mouse'}, {4,'Dog'}})
countsByKey = rdd:countByKey()
print(countsByKey[4])
3
print(countsByKey[5])
1

Streaming with a socket text datasource

Start by installing network support:

$ luarocks install luasocket

Next, start a local network service with netcat:

$ nc -lk 9999

Start a Spark Streaming job to read from the network service:

local sc = require 'stuart'.NewContext()
local ssc = require 'stuart'.NewStreamingContext(sc, 0.5)

local dstream = ssc:socketTextStream('localhost', 9999)
dstream:foreachRDD(function(rdd)
  print('Received RDD: ' .. rdd:collect())
end)

ssc:start()
ssc:awaitTerminationOrTimeout(10)
ssc:stop()

Then type some input into netcat:

abc
123

Streaming with a custom receiver

This custom receiver acts like a SocketInputDStream, and reads lines of text from a socket.

local class = require 'middleclass'
local socket = require 'socket'
local stuart = require 'stuart'
local Receiver = require 'stuart.streaming.Receiver'

-- MyReceiver ------------------------------

local MyReceiver = class('MyReceiver', Receiver)

function MyReceiver:initialize(ssc, hostname, port)
  Receiver.initialize(self, ssc)
  self.hostname = hostname
  self.port = port or 0
end

function MyReceiver:onStart()
  self.conn = socket.connect(self.hostname, self.port)
end

function MyReceiver:onStop()
  if self.conn ~= nil then self.conn:close() end
end

function MyReceiver:run(durationBudget)
  local timeOfLastYield = socket.gettime()
  local rdds = {}
  local minWait = 0.02 -- never block less than 20ms
  while true do
    local elapsed = socket.gettime() - timeOfLastYield
    if elapsed > durationBudget then
      coroutine.yield(rdds)
      rdds = {}
      timeOfLastYield = socket.gettime()
    else
      self.conn:settimeout(math.max(minWait, durationBudget - elapsed))
      local line, err = self.conn:receive('*l')
      if not err then
        rdds[#rdds+1] = self.ssc.sc:makeRDD({line})
      end
    end
  end
end

-- Spark Streaming Job ------------------------------

sc = stuart.NewContext()
ssc = stuart.NewStreamingContext(sc, 0.5)

local receiver = MyReceiver:new(ssc, 'localhost', 9999)
local dstream = ssc:receiverStream(receiver)
dstream:foreachRDD(function(rdd)
  print('Received RDD: ' .. rdd:collect())
end)
ssc:start()
ssc:awaitTerminationOrTimeout(10)
ssc:stop()

Embedding

Modules named stuart.interface.* provide interfaces to hardware or a host OS, designed to make it easy for you to preload your own custom module that is specific to your host application or device. These interfaces are seen as a public API, and so any changes to them will increment the SemVer versioning accordingly.

stuart.interface.clock

Used to measure time, which is required by the StreamingContext cooperative multitasking. On an OS, implementation defaults to LuaSocket gettime() with 4 decimals of precision. Falls back on Lua os.time(os.clock('*t')) with 0 digits of precision (whole seconds).

stuart.interface.sleep

Function used to sleep, when all receivers don't use their full timeslice allotments. Used to prevent pegging the CPU on systems where that makes sense, such as a host OS.

Compatibility

Stuart is compatible with:

See the stuart-hardware project for edge hardware specific integration guides.

Libraries for Stuart

These companion libraries, unlike Stuart, do not yet support Lua 5.3.

To embed Stuart into a Go app, use:

Roadmap

  • eLua ("Embedded Lua") compatibility. eLua is a 5.1 baremetal VM for microcontrollers. An eLua scheduler is required that avoids coroutines, because large eLua jobs require being transpiled to C to execute from firmware, and coroutines are impractical to transpile.
  • Support PMML Import via a stuart-pmml companion library
  • Support a Redis scheduler that partitions RDDs across Redis servers, and sends Lua closures into Redis for execution.
  • Support OpenCL or CUDA schedulers that send Lua closures into a GPU for execution.

Design

Stuart is designed for real-time and embedding, and so it follows some rules:

  • It does not perform deferred evaluation of anything; all compute costs are paid upfront for predictable throughput.
  • It uses pure Lua and does not include native C code. This maximizes portability and opportunity to be cross-compiled. Any potential C code optimizations are externally sourced through the module loader. For example, Stuart links to lunajson, but it also detects and uses cjson when that native module is present.
  • It does not execute programs (like ls or dir to list files), because there may not even be an OS.
  • It should be able to eventually do everything that Apache Spark does.

Why Spark?

While many frameworks deliver streaming analytics capabilities, Spark leads the pack in numbers of trained data scientists, numbers of SaaS environments where Spark models can be built and trained, numbers of contributors moving the platform forward, numbers of universities teaching it, and net commercial investment.

Why Lua?

Depoyment. Amalgamated Lua jobs with inlined module dependencies solves the Spark job deployment problem, and obviates the need for any shared filesystem or brittle classpath coordination. Redis Scripting showcases the power of SHA1 content hashing for Lua job distribution.

Packaging. Lua jobs, like JavaScript, are easy to minify, and statically analyze to strip out unused modules and function calls. Your job script only need be as large as the number of Spark capabilities it makes use of.

Portability. Because Lua is a tiny language that elegantly supports classes and closures, it serves as a better source of truth for functional algorithms than Scala. This makes it relatively easy for Stuart jobs to be transpiled into Scala, Java, Python, Go, C, or maybe even CUDA, or to be interpreted by a VM in any of those same environments, which significantly extends Spark's reach by divorcing it from the JVM.

Embedding. Lua is arguably one of the most crash-proof language runtimes, making it attractive for industrial automation, sensors, wearables, and microcontrollers. Whereas JVM-based analytics tend to require an operator.

GPUs. If you are thinking about pushing closures into a GPU, Lua seems like a reasonable choice, and one of the easier languages to transpile into OpenCL or CUDA.

Torch. Torch is the original deep-learning library ecosystem, 15+ years mature, and with deep ties to university and leading commercial interests. It runs on mobile phones, and serves as a fantastic case in point for why Lua makes sense for analytics jobs. A data scientist should be able to use Spark and Torch side-by-side, and maybe even from the same Spark Streaming control loop.

Building

The LuaRocks built-in build system is used for packaging.

$ luarocks make rockspecs/stuart-<version>.rockspec
stuart <version> is now built and installed in /usr/local (license: Apache 2.0)

Testing

Testing with lua-cjson:

$ luarocks install busted
$ luarocks install lua-cjson
$ luarocks intall moses
$ busted -v --defer-print
17/11/12 08:46:51 INFO Running Stuart (Embedded Spark) version 2.2.0
...
141 successes / 0 failures / 0 errors / 0 pending : 12.026833 seconds

Testing with lunajson:

$ luarocks remove lua-cjson
$ busted -v --defer-print
17/11/12 08:46:51 INFO Running Stuart (Embedded Spark) version 2.2.0
...
139 successes / 0 failures / 0 errors / 2 pending : 12.026833 seconds

Pending → ...
util.json can decode a scalar using cjson
... cjson not installed

Pending → ...
util.json can decode an object using cjson
... cjson not installed

Testing with a WebHDFS endpoint:

$ WEBHDFS_URL=webhdfs://localhost:50075/webhdfs busted -v --defer-print

Testing with a Specific Lua Version

Various Dockerfiles are made available in the root directory to provide a specific Lua VM for the test suite:

  • Test-Lua5.1.Dockerfile
  • Test-Lua5.2.Dockerfile
  • Test-Lua5.3.Dockerfile
$ docker build -f Test-Lua5.3.Dockerfile -t test .
$ docker run -it test busted -v --defer-print
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
159 successes / 0 failures / 0 errors / 5 pending : 10.246418 seconds