There are many ways to use omniparser in your code/service/app programmatically.
This is covered in Getting Started, copy it here for completeness.
schema, err := omniparser.NewSchema("your schema name", strings.NewReader("your schema content"))
if err != nil { ... }
transform, err := schema.NewTransform("your input name", strings.NewReader("your input content"), &transformctx.Ctx{})
if err != nil { ... }
for {
output, err := transform.Read()
if err == io.EOF {
break
}
if err != nil { ... }
// output contains a []byte of the ingested and transformed record.
// Also transform.RawRecord() gives you access to the raw record.
fmt.Println(transform.RawRecord().Checksum())
}
Note this out-of-box omniparser setup contains only the omni.2.1
schema handler, meaning only schemas
whose parser_settings.version
is omni.2.1
are supported. omni.2.1.
schema handler's supported file
formats include: delimited (CSV, TSV, etc), EDI, XML, JSON, fixed-length. omni.2.1.
schema handler's
supported built-in custom_func
s are listed here.
If the built-in custom_func
s aren't enough, you can add your own custom functions by
doing this (note the linked
sample does more than just adding a new custom_func
):
schema, err := omniparser.NewSchema(
"your schema name",
strings.NewReader("your schema content"),
omniparser.Extension{
CreateSchemaHandler: omniv21.CreateSchemaHandler,
CustomFuncs: customfuncs.Merge(
customfuncs.CommonCustomFuncs, // global custom_funcs
v21.OmniV21CustomFuncs, // omni.2.1 custom_funcs
customfuncs.CustomFuncs{
"normalize_severity": normalizeSeverity, // <====== your own custom_funcs
})})
if err != nil { ... }
transform, err := schema.NewTransform("your input name", strings.NewReader("your input content"), &transformctx.Ctx{})
if err != nil { ... }
for {
output, err := transform.Read()
if err == io.EOF {
break
}
if err != nil { ... }
// output contains a []byte of the ingested and transformed record.
}
Each custom_func
must be a Golang function with the first param being *transformctx.Ctx
. There can be
an optional second param of *idr.Node
type (like in this
sample). If omniparser detects a custom_func
second param is of *idr.Node
type, then the current contextual IDR node will be passed into the function
automatically. The rest params can be of any type, as long as they will match the types of data that are
fed into the function in transform_declarations
.
While built-in omni.2.1
schema handler already supports most popular file formats in a typical
ETL pipeline, new file format(s) can be added into the schema handler, so it can ingest new formats
while using the same extensible/capable transform (transform_declarations
) logic.
On a high level, a FileFormat
is a component
that knows how to ingest a data record, in streaming fashion, from a certain file format, and
convert it into an idr.Node
tree, for later processing and transform.
Typically, a new FileFormat
may require some
additional information in a schema (usually in a file_declaration
section), thus omni.2.1
schema
handler will give a new custom FileFormat
a
chance to validate a schema. Then the schema handler will ask
the new FileFormat
to create a format specific
reader, whose job is to consume input stream, and convert each record into the IDR format.
See this example for how to add a new
FileFormat
.
To complete omniparser's full extensibility picture, we allow adding complete new schema handlers,
whether they're for major schema version upgrades that break backward-compatibility, or for brand-new
parsing/transform paradigms. In fact, we utilize this customizability capability ourselves for
integrating those legacy omniparser schema supports (schema versions that are older than omni.2.1
and are not compatible with omni.2.1
): take a glimpse at: https://github.com/jf-tech/omniparserlegacy.
The most canonical use case of omniparser would be a (micro)service that is part of a larger ETL pipeline that gets different input files/streams from different external integration influx points, performs schema driven (thus codeless) parsing and transform to process and standardize the inputs into internal formats for later stage loading (L) part of ETL.
Because omniparser's parsing and transform is schema driven and involves little/no coding, it enables faster and at-scale ETL integration possibly done by non-coding engineers or support staffs:
First in your service, there needs to be a schema cache component that loads and refreshes all the
schemas from a schema repository (could be a REST API, or a database, or some storage). These schemas
are parsed, validated (by omniparser.NewSchema
calls) and cached.
As different integration partners' input streams are coming in, the service will, based on some
criteria, such as partner IDs, select which schema to use for a particular input. Once schema
selection is completed, the service calls schema.NewTransform
to create an
instance of a transform operation for this particular input, performs the parsing and transform, and
sends the standardized output into a later stage in the ETL pipeline.
Omniparser is currently only implemented in Golang (we do want to port it to other languages, at least Java, in the near future), the only way to utilize it, if your service or environment is not in Golang, is to sidecar it, by either making it a standard alone service or shell-exec omniparser, both of which involves omniparser's CLI.
Recall in Getting Started we demonstrated omniparser
CLI's transform
command. You can shell-exec it from your service. Keep in mind the following if you
want to go down this path:
- you will have to pre-compile omniparser CLI binary (which needs to platform/OS specific) and ship with your service, and
- you will need to copy down the input file locally in your service before invoking the CLI, and then
intercept
stdout
/stderr
from the CLI and its exit code in order to get the results.
Omniparser CLI has another command server
, which simply launches the CLI into a http listening service
that exposes a REST API:
POST
- request
Content-Type
:application/json
- request JSON:
{ "schema": "... the schema content, required ...", "input": "... the input to be parsed and transformed, required ...", "properties": { ... JSON string map used for `external` transforms, optional ...} }
Keep in mind the following if you want to go down this path:
- you will need to host this CLI-turned omniparser service somewhere accessible to your service,
- you lose the benefit of omniparser stream processing, which enables parsing infinitely large input,
because now you need to send the input as a single string in the
input
field of the HTTP POST request.
There are many components inside omniparser can be useful in your code, even if you don't want to use omniparser as a whole for parsing and transforming input file/data. Here is a selected list of these components:
-
DateTimeToRFC3339()
,DateTimeLayoutToRFC3339()
,DateTimeToEpoch()
,EpochToDateTimeRFC3339()
Parsing and formatting date/time stamps isn't trivial at all, especially when time zones are involved. These functions can be used independent of omniparser and are very useful when your Golang code deals with date/time a lot.
-
Omniparser uses github.com/dop251/goja as the native Golang javascript engine. Yes you can directly use
goja
, but you'll have to deal with performance related vm caching, and error handling. Instead you can directly useJavaScript
function.
We have an in-depth doc talking about IDR, which proves to be really useful in many document
parsing situations, even outside of omniparser realm. This idr
package contains the IDR node/tree
definitions, creation, caching, recycling and releasing mechanisms, serialization helpers, XPath
assisted navigation and querying, and two powerful stream readers for JSON and XML inputs.
Particularly, the JSON/XML readers are two powerful parsers, capable of ingesting JSON/XML data in streaming fashion assisted by XPath style target filtering, thus enabling processing arbitrarily large inputs.
Use NewReader()
to create a CSV reader that does
- header column validation
- header/data row jumping
- XPath based data row filtering
- Mis-escaped quote replacement
- Context-aware error message
For more reader specific settings/configurations, check CSV in Depth page.
Use NewReader()
to create a fixed-length
reader that does
- row based or header/footer based envelope parsing
- XPath based data row filtering
- Context-aware error message
For more reader specific settings/configurations, check Fixed-Length in Depth page.
Use NewReader()
to create an EDI reader that does
- segment min/max validation
- XPath based data row filtering
- Context-aware error message
For more reader specific settings/configurations, check EDI in Depth page.
Use NewNonValidatingReader()
to create a
non-validating EDI segment reader. Sometimes user might not want the full EDI reader that does
many packing/unpacking and structural/hierarchical validations, they simply need an EDI segment
reader that reads out all the raw segments and their elements/components.
Usage example:
r := edi.NewNonValidatingReader(
strings.NewReader("....."),
&edi.FileDecl{
SegDelim: ...,
ElemDelim: ...,
...,
// No need to set SegDecls. Just all the needed top level edi.FileDecl settings.
})
for {
seg, err := r.Read()
if err == io.EOF {
break
}
if err != nil { ... }
// seg contains the raw segment data, and is of edi.RawSeg type.
}
See IDR notes about the JSON/XML readers above.
See IDR notes about the JSON/XML readers above.