Skip to content

Commit

Permalink
docs: update README
Browse files Browse the repository at this point in the history
  • Loading branch information
farbodahm committed Nov 25, 2024
1 parent 259c251 commit e6296e9
Showing 1 changed file with 125 additions and 0 deletions.
125 changes: 125 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,87 @@ these DataFrames according to the business logic. The Sparkle
application then automatically writes the output of this
transformation to the desired destination.

Sparkle follows a streamlined approach, designed to reduce effort in
data transformation workflows. Here’s how it works:

1. **Specify Input Locations and Types**: Easily set up input locations
and types for your data. Sparkle’s configuration makes this effortless,
removing typical setup hurdles and letting you get started
with minimal overhead.

```python
...
config=Config(
...,
kafka_input=KafkaReaderConfig(
KafkaConfig(
bootstrap_servers="localhost:9119",
credentials=Credentials("test", "test"),
),
kafka_topic="src_orders_v1",
)
),
readers={"orders": KafkaReader},
...
```

2. **Define Business Logic**: This is where developers spend most of their time.
Using Sparkle, you create transformations on input DataFrames, shaping data
according to your business needs.

```python
# Override process function from parent class
def process(self) -> DataFrame:
return self.input["orders"].read().join(
self.input["users"].read()
)
```

3. **Specify Output Locations**: Sparkle automatically writes transformed data to
the specified output location, streamlining the output step to make data
available wherever it’s needed.

```python
...
config=Config(
...,
iceberg_output=IcebergConfig(
database_name="all_products",
table_name="orders_v1",
),
),
writers=[IcebergWriter],
...
```


This structure lets developers concentrate on meaningful transformations while
Sparkle takes care of configurations, testing, and output management.

## Connectors 🔌

Sparkle offers specialized connectors for common data sources and sinks,
making data integration easier. These connectors are designed to
enhance—not replace—the standard Spark I/O options,
streamlining development by automating complex setup requirements.

### Readers

1. **Iceberg Reader**: Simplifies reading from Iceberg tables,
making integration with Spark workflows a breeze.

2. **Kafka Reader (with Avro schema registry)**: Ingest streaming data
from Kafka with seamless Avro schema registry integration, supporting
data consistency and schema evolution.

### Writers

1. **Iceberg Writer**: Easily write transformed data to Iceberg tables,
ideal for time-traveling, partitioned data storage.

2. **Kafka Writer**: Publish data to Kafka topics with ease, supporting
real-time analytics and downstream consumers.

## Getting Started 🚀

Sparkle is currently under heavy development, and we are continuously
Expand All @@ -73,6 +154,50 @@ To stay updated on our progress and access the latest information,
follow us on [LinkedIn](https://nl.linkedin.com/company/datachefco)
and [GitHub](https://github.com/DataChefHQ/Sparkle).

## Example

This is the simplest example to create a Orders pipelines by reading records
from a Kafka topic and writing it to an Iceberg table:

```python
from sparkle.config import Config, IcebergConfig, KafkaReaderConfig
from sparkle.config.kafka_config import KafkaConfig, Credentials
from sparkle.writer.iceberg_writer import IcebergWriter
from sparkle.application import Sparkle
from sparkle.reader.kafka_reader import KafkaReader

from pyspark.sql import DataFrame


class CustomerOrders(Sparkle):
def __init__(self):
super().__init__(
config=Config(
app_name="orders",
app_id="orders-app",
version="0.0.1",
database_bucket="s3://test-bucket",
checkpoints_bucket="s3://test-checkpoints",
iceberg_output=IcebergConfig(
database_name="all_products",
table_name="orders_v1",
),
kafka_input=KafkaReaderConfig(
KafkaConfig(
bootstrap_servers="localhost:9119",
credentials=Credentials("test", "test"),
),
kafka_topic="src_orders_v1",
),
),
readers={"orders": KafkaReader},
writers=[IcebergWriter],
)

def process(self) -> DataFrame:
return self.input["orders"].read()
```

## Contributing 🤝

We welcome contributions from the community! If you're interested in
Expand Down

0 comments on commit e6296e9

Please sign in to comment.