Skip to content

Commit

Permalink
Merge branch 'labs-main'
Browse files Browse the repository at this point in the history
  • Loading branch information
a0x8o committed Dec 5, 2023
2 parents 25606ed + 5294bac commit d25c9dc
Show file tree
Hide file tree
Showing 189 changed files with 45,215 additions and 4,100 deletions.
2 changes: 2 additions & 0 deletions .github/actions/python_build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ runs:
run: |
cd python
pip install build wheel pyspark==${{ matrix.spark }} numpy==${{ matrix.numpy }}
pip install numpy==${{ matrix.numpy }}
pip install --no-build-isolation --no-cache-dir --force-reinstall gdal==${{ matrix.gdal }}
pip install .
- name: Test and build python package
shell: bash
Expand Down
15 changes: 12 additions & 3 deletions .github/actions/r_build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ runs:
name: Download and unpack Spark
shell: bash
run: |
wget -P /usr/spark-download/raw https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
tar zxvf /usr/spark-download/raw/spark-3.2.1-bin-hadoop2.7.tgz -C /usr/spark-download/unzipped
wget -P /usr/spark-download/raw https://archive.apache.org/dist/spark/spark-${{ matrix.spark }}/spark-${{ matrix.spark }}-bin-hadoop3.tgz
tar zxvf /usr/spark-download/raw/spark-${{ matrix.spark }}-bin-hadoop3.tgz -C /usr/spark-download/unzipped
- name: Create R environment
shell: bash
run: |
Expand All @@ -50,16 +50,25 @@ runs:
run: |
cd R
Rscript --vanilla generate_docs.R
env:
SPARK_HOME: /usr/spark-download/unzipped/spark-${{ matrix.spark }}-bin-hadoop3
- name: Build R package
shell: bash
run: |
cd R
Rscript --vanilla build_r_package.R
- name: Test R package
env:
SPARK_HOME: /usr/spark-download/unzipped/spark-${{ matrix.spark }}-bin-hadoop3
- name: Test SparkR package
shell: bash
run: |
cd R/sparkR-mosaic
Rscript --vanilla tests.R
- name: Test sparklyr package
shell: bash
run: |
cd R/sparklyr-mosaic
Rscript --vanilla tests.R
- name: Copy R artifacts to GH Actions run
shell: bash
run: |
Expand Down
2 changes: 2 additions & 0 deletions .github/actions/scala_build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ runs:
pip install databricks-mosaic-gdal==${{ matrix.gdal }}
sudo tar -xf /opt/hostedtoolcache/Python/${{ matrix.python }}/x64/lib/python3.9/site-packages/databricks-mosaic-gdal/resources/gdal-${{ matrix.gdal }}-filetree.tar.xz -C /
sudo tar -xhf /opt/hostedtoolcache/Python/${{ matrix.python }}/x64/lib/python3.9/site-packages/databricks-mosaic-gdal/resources/gdal-${{ matrix.gdal }}-symlinks.tar.xz -C /
pip install numpy==${{ matrix.numpy }}
pip install gdal==${{ matrix.gdal }}
- name: Test and build the scala JAR - skip tests is false
if: inputs.skip_tests == 'false'
shell: bash
Expand Down
82 changes: 63 additions & 19 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,25 +1,69 @@
DB license
Databricks License
Copyright (2022) Databricks, Inc.

Copyright (2022) Databricks, Inc.
Definitions.

Agreement: The agreement between Databricks, Inc., and you governing
the use of the Databricks Services, as that term is defined in
the Master Cloud Services Agreement (MCSA) located at
www.databricks.com/legal/mcsa.

Licensed Materials: The source code, object code, data, and/or other
works to which this license applies.

Definitions.
Scope of Use. You may not use the Licensed Materials except in
connection with your use of the Databricks Services pursuant to
the Agreement. Your use of the Licensed Materials must comply at all
times with any restrictions applicable to the Databricks Services,
generally, and must be used in accordance with any applicable
documentation. You may view, use, copy, modify, publish, and/or
distribute the Licensed Materials solely for the purposes of using
the Licensed Materials within or connecting to the Databricks Services.
If you do not agree to these terms, you may not view, use, copy,
modify, publish, and/or distribute the Licensed Materials.

Redistribution. You may redistribute and sublicense the Licensed
Materials so long as all use is in compliance with these terms.
In addition:

- You must give any other recipients a copy of this License;
- You must cause any modified files to carry prominent notices
stating that you changed the files;
- You must retain, in any derivative works that you distribute,
all copyright, patent, trademark, and attribution notices,
excluding those notices that do not pertain to any part of
the derivative works; and
- If a "NOTICE" text file is provided as part of its
distribution, then any derivative works that you distribute
must include a readable copy of the attribution notices
contained within such NOTICE file, excluding those notices
that do not pertain to any part of the derivative works.

Agreement: The agreement between Databricks, Inc., and you governing the use of the Databricks Services, which shall be, with respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless you have entered into a separate written agreement with Databricks governing the use of the applicable Databricks Services.
You may add your own copyright statement to your modifications and may
provide additional license terms and conditions for use, reproduction,
or distribution of your modifications, or for any such derivative works
as a whole, provided your use, reproduction, and distribution of
the Licensed Materials otherwise complies with the conditions stated
in this License.

Software: The source code and object code to which this license applies.
Termination. This license terminates automatically upon your breach of
these terms or upon the termination of your Agreement. Additionally,
Databricks may terminate this license at any time on notice. Upon
termination, you must permanently delete the Licensed Materials and
all copies thereof.

Scope of Use. You may not use this Software except in connection with your use of the Databricks Services pursuant to the Agreement. Your use of the Software must comply at all times with any restrictions applicable to the Databricks Services, generally, and must be used in accordance with any applicable documentation. You may view, use, copy, modify, publish, and/or distribute the Software solely for the purposes of using the code within or connecting to the Databricks Services. If you do not agree to these terms, you may not view, use, copy, modify, publish, and/or distribute the Software.
DISCLAIMER; LIMITATION OF LIABILITY.

Redistribution. You may redistribute and sublicense the Software so long as all use is in compliance with these terms. In addition:

You must give any other recipients a copy of this License;
You must cause any modified files to carry prominent notices stating that you changed the files;
You must retain, in the source code form of any derivative works that you distribute, all copyright, patent, trademark, and attribution notices from the source code form, excluding those notices that do not pertain to any part of the derivative works; and
If the source code form includes a "NOTICE" text file as part of its distribution, then any derivative works that you distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the derivative works.
You may add your own copyright statement to your modifications and may provide additional license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the Software otherwise complies with the conditions stated in this License.

Termination. This license terminates automatically upon your breach of these terms or upon the termination of your Agreement. Additionally, Databricks may terminate this license at any time on notice. Upon termination, you must permanently delete the Software and all copies thereof.

DISCLAIMER; LIMITATION OF LIABILITY.

THE SOFTWARE IS PROVIDED “AS-IS” AND WITH ALL FAULTS. DATABRICKS, ON BEHALF OF ITSELF AND ITS LICENSORS, SPECIFICALLY DISCLAIMS ALL WARRANTIES RELATING TO THE SOURCE CODE, EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, IMPLIED WARRANTIES, CONDITIONS AND OTHER TERMS OF MERCHANTABILITY, SATISFACTORY QUALITY OR FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. DATABRICKS AND ITS LICENSORS TOTAL AGGREGATE LIABILITY RELATING TO OR ARISING OUT OF YOUR USE OF OR DATABRICKS’ PROVISIONING OF THE SOURCE CODE SHALL BE LIMITED TO ONE THOUSAND ($1,000) DOLLARS. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THE LICENSED MATERIALS ARE PROVIDED “AS-IS” AND WITH ALL FAULTS.
DATABRICKS, ON BEHALF OF ITSELF AND ITS LICENSORS, SPECIFICALLY
DISCLAIMS ALL WARRANTIES RELATING TO THE LICENSED MATERIALS, EXPRESS
AND IMPLIED, INCLUDING, WITHOUT LIMITATION, IMPLIED WARRANTIES,
CONDITIONS AND OTHER TERMS OF MERCHANTABILITY, SATISFACTORY QUALITY OR
FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. DATABRICKS AND
ITS LICENSORS TOTAL AGGREGATE LIABILITY RELATING TO OR ARISING OUT OF
YOUR USE OF OR DATABRICKS’ PROVISIONING OF THE LICENSED MATERIALS SHALL
BE LIMITED TO ONE THOUSAND ($1,000) DOLLARS. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE LICENSED MATERIALS OR
THE USE OR OTHER DEALINGS IN THE LICENSED MATERIALS.
1 change: 1 addition & 0 deletions R/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
**/.Rhistory
**/*.tar.gz
/sparklyr-mosaic/metastore_db/
9 changes: 1 addition & 8 deletions R/build_r_package.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,6 @@
spark_location <- "/usr/spark-download/unzipped/spark-3.2.1-bin-hadoop2.7"
Sys.setenv(SPARK_HOME = spark_location)

spark_location <- Sys.getenv("SPARK_HOME")
library(SparkR, lib.loc = c(file.path(spark_location, "R", "lib")))


library(pkgbuild)
library(sparklyr)



build_mosaic_bindings <- function(){
## build package
Expand Down
95 changes: 50 additions & 45 deletions R/generate_R_bindings.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ library(methods)

parser <- function(x){
#split on left bracket to get name
splitted = strsplit(x, "(", fixed=T)[[1]]
splitted <- strsplit(x, "(", fixed=T)[[1]]
# extract function name
function_name = splitted[1]
function_name <- splitted[1]
# remove the trailing bracket
args = gsub( ")", '',splitted[2], fixed=T)
args = strsplit(args, ", ", fixed=T)[[1]]
args = lapply(args, function(x){strsplit(x, ": ", fixed=T)}[[1]])
output = list(
args <- gsub( ")", '',splitted[2], fixed=T)
args <- strsplit(args, ", ", fixed=T)[[1]]
args <- lapply(args, function(x){strsplit(x, ": ", fixed=T)}[[1]])
output <- list(
"function_name" = function_name
,"args"=args
)
Expand All @@ -24,8 +24,8 @@ parser <- function(x){

############################################################
build_generic <- function(input){
function_name = input$function_name
args = lapply(input$args, function(x){x[1]})
function_name <- input$function_name
args <- lapply(input$args, function(x){x[1]})
paste0(
'#\' @rdname ', function_name, '
setGeneric(
Expand All @@ -35,21 +35,9 @@ build_generic <- function(input){
')
}


build_generic2 <- function(input){
function_name = input$function_name
args = lapply(input$args, function(x){x[1]})
paste0(
'#\' @rdname ', function_name, '
setGeneric(
name="',function_name,'"
,def=function(',paste0(args, collapse=','), ') {standardGeneric("',function_name, '")}
)
')
}
############################################################
build_column_specifiers <- function(input){
args = lapply(input$args, function(x){x[1]})
args <- lapply(input$args, function(x){x[1]})
build_column_specifier <- function(arg){
return(paste0(arg, '@jc'))
}
Expand All @@ -62,29 +50,32 @@ build_column_specifiers <- function(input){
}
############################################################
build_method<-function(input){
function_name = input$function_name
arg_names = lapply(input$args, function(x){c(x[1])})
function_name <- input$function_name
arg_names <- lapply(input$args, function(x){c(x[1])})
#this handles converting non-Column arguments to their R equivalents
argument_parser <- function(x){
if(x[2] == 'Int'){
x[2] = "numeric"
x[2] <- "numeric"
}
else if(x[2] == 'String'){
x[2] = "character"
x[2] <- "character"
}
else if(x[2] == 'Double'){
x[2] = "numeric"
x[2] <- "numeric"
}
else if(x[2] == 'Boolean') {
x[2] <- "logical"
}
x
}
# convert scala type to R types
args = lapply(input$args, argument_parser)
args <- lapply(input$args, argument_parser)
# take a copy for building the docs
param_args = args
param_args <- args
# wrap the strings in speech marks
args = lapply(args, function(x){c(x[1], paste0("'", x[2], "'"))})
args <- lapply(args, function(x){c(x[1], paste0("'", x[2], "'"))})
# collapse down to a single string
args = lapply(args, function(x){paste0(x, collapse= ' = ')})
args <- lapply(args, function(x){paste0(x, collapse= ' = ')})
column_specifiers <- build_column_specifiers(input)
docstring <- paste0(
c(paste0(c("#'", function_name), collapse=" "),
Expand Down Expand Up @@ -116,48 +107,62 @@ build_method<-function(input){
############################################################
get_function_names <- function(scala_file_path){
#scala_file_path = "~/Documents/mosaic/src/main/scala/com/databricks/labs/mosaic/functions/MosaicContext.scala"
scala_file_object = file(scala_file_path)
scala_file_object <- file(scala_file_path)

scala_file = readLines(scala_file_object)
scala_file <- readLines(scala_file_object)
closeAllConnections()
# find where the methods start
start_string = " object functions extends Serializable {"
start_index = grep(start_string, scala_file, fixed=T) + 1
start_string <- " object functions extends Serializable {"
start_index <- grep(start_string, scala_file, fixed=T) + 1
# find the methods end - will be the next curly bracket
# need to find where the matching end brace for the start string is located.
# counter starts at 1 as the start string includes the opening brace
brace_counter = 1
brace_counter <- 1

for(i in start_index : length(scala_file)){
# split the string into characters - returns a list so unlist it
line_characters <- unlist(strsplit(scala_file[i], ''))
# count the number of brace opens
n_opens = sum(grepl("{", line_characters, fixed=T))
n_opens <- sum(grepl("{", line_characters, fixed=T))
# count the number of brace closes
n_closes = sum(grepl("}", line_characters, fixed=T))
n_closes <- sum(grepl("}", line_characters, fixed=T))
# update the counter
brace_counter <- brace_counter + n_opens - n_closes
if (brace_counter == 0) break

}
methods_to_bind = scala_file[start_index:i]
methods_to_bind <- scala_file[start_index:i]
# remove any line that doesn't start with def
def_mask = grepl('\\s+def .*', methods_to_bind)
methods_to_bind = methods_to_bind[def_mask]
def_mask <- grepl('\\s+def .*', methods_to_bind)
methods_to_bind <- methods_to_bind[def_mask]
# parse the string to get just the function_name(input:type...) pattern
methods_to_bind = unlist(lapply(methods_to_bind, function(x){
methods_to_bind <- unlist(lapply(methods_to_bind, function(x){
substr(x
, regexpr("def ", x, fixed=T)[1]+4 # get the starting point to account for whitespace
, regexpr("): ", x, fixed=T)[1] # get the end point of where the return is.
)
}
))
sort(methods_to_bind, T)
sort_methods_by_argcount(methods_to_bind)
}

############################################################
sort_methods_by_argcount <- function(methods) {
# Split the strings by colon and calculate the number of colons
method_names <- sapply(strsplit(methods, "\\("), function(x) x[1])
argcount <- sapply(strsplit(methods, ","), function(x) length(x) - 1)

# Use the order function to sort first alphabetically and then by the number of colons
order_indices <- order(method_names, argcount)

# Return the sorted list
methods_sorted <- methods[order_indices]
return(methods_sorted)
}

############################################################
build_sparklyr_mosaic_function <- function(input){
function_name = input$function_name
function_name <- input$function_name
paste0(

"#' ", function_name, "\n\n",
Expand Down Expand Up @@ -191,7 +196,7 @@ main <- function(scala_file_path){
##########################
##########################
# build sparkr functions
function_data = get_function_names(scala_file_path)
function_data <- get_function_names(scala_file_path)
parsed <- lapply(function_data, parser)


Expand Down Expand Up @@ -223,9 +228,9 @@ main <- function(scala_file_path){
# supplementary files
sparkr_supplementary_files <- c("sparklyr-mosaic/enableMosaic.R", "sparklyr-mosaic/sparkFunctions.R")
copy_supplementary_file(sparkr_supplementary_files, "sparklyr-mosaic/sparklyrMosaic/R/")

}


args <- commandArgs(trailingOnly = T)
if (length(args) != 1){
stop("Please provide the MosaicContext.scala file path to generate_sparkr_functions.R")
Expand Down
4 changes: 1 addition & 3 deletions R/generate_docs.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
spark_location <- "/usr/spark-download/unzipped/spark-3.2.1-bin-hadoop2.7"
Sys.setenv(SPARK_HOME = spark_location)

spark_location <- Sys.getenv("SPARK_HOME")
library(SparkR, lib.loc = c(file.path(spark_location, "R", "lib")))
library(roxygen2)

Expand Down
4 changes: 1 addition & 3 deletions R/install_deps.R
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
options(repos = c(CRAN = "https://packagemanager.posit.co/cran/__linux__/focal/latest"))

install.packages("pkgbuild")
install.packages("roxygen2")
install.packages("sparklyr")
install.packages(c("pkgbuild", "testthat", "roxygen2", "sparklyr"))
Loading

0 comments on commit d25c9dc

Please sign in to comment.