Skip to content

Latest commit

 

History

History
150 lines (103 loc) · 4.01 KB

README.md

File metadata and controls

150 lines (103 loc) · 4.01 KB

kibior: easy scientific data handling, searching and sharing with Elasticsearch

Project Status: Active Build Status

Version: 0.1.1

TL;DR

What kibior is a R package dedicated to ease the pain of data handling in science, and more notably with biological data.
Where kibior is using Elasticsearch as database and search engine.
Who kibior is built for data science and data manipulation, so when any data-related action or need is involved, notably sharing data. It mainly targets bioinformaticians, and more broadly, data scientists.
When Available now from this repository, or CRAN repository.
Public instances Use the $get_kibio_instance() method to connect to Kibio and access known datasets. See Kibio datasets at the end of this document for a complete list.
Cite this package In R session, run citation("kibior")
Publication 10.1093/bioinformatics/btab157

Main features

This package allows:

  • Pushing, pulling, joining, sharing and searching tabular data between an R session and one or multiple Elasticsearch instances/clusters.
  • Massive data query and filter with Elasticsearch engine.
  • Multiple living Elasticsearch connections to different addresses.
  • Method autocompletion in proper environments (e.g. R cli, RStudio).
  • Import and export datasets from an to files.
  • Server-side execution for most of operations (i.e. on Elasticsearch instances/clusters).

How

Install

# Get from CRAN
install.packages("kibior")

# or get the latest from Github
devtools::install_github("regisoc/kibior")

Run

# load
library(kibior)

# Get a specific instance
kc <- Kibior$new("server_or_address", port)

# Or try something bigger...
kibio <- Kibior$get_kibio_instance()
kibio$list()

Examples

Here is an extract of some of the features proposed by KibioR. See Introduction vignette for more advanced usage.

Example: push datasets

# Push data (R memory -> Elasticsearch)
dplyr::starwars %>% kc$push("sw")
dplyr::storms %>% kc$push("st")

Example: pull datasets

# Pull data with columns selection (Elasticsearch -> R memory)
kc$pull("sw", query = "homeworld:(naboo || tatooine)", 
              columns = c("name", "homeworld", "height", "mass", "species"))
# see vignette for query syntax

Example: copy datasets

# Copy dataset (Elasticsearch internal operation)
kc$copy("sw", "sw_copy")

Example: delete datasets

# Delete datasets
kc$delete("sw_copy")

Example: list, match dataset names

# List available datasets
kc$list()

# Search for index names starting with "s"
kc$match("s*")

Example: get columns names and list unique keys in values

# Get columns of all datasets starting with "s"
kc$columns("s*")

# Get unique values of a column
kc$keys("sw", "homeworld")

Example: some Elasticsearch basic statistical methods

# Count number of lines in dataset
kc$count("st")

# Count number of lines with query (name of the storm is Anita)
kc$count("st", query = "name:anita")

# Generic stats on two columns
kc$stats("sw", c("height", "mass"))

# Specific descriptive stats with query
kc$avg("sw", c("height", "mass"), query = "homeworld:naboo")

Example: join

# Inner join between:
#   1/ a Elasticsearch-based dataset with query ("sw"), 
#   2/ and a in-memory R dataset (dplyr::starwars) 
kc$inner_join("sw", dplyr::starwars, 
              left_query = "hair_color:black",
              left_columns = c("name", "mass", "height"),
              by = "name")