Skip to content

Latest commit

 

History

History
169 lines (112 loc) · 10.1 KB

README.md

File metadata and controls

169 lines (112 loc) · 10.1 KB

Contoso dog walking event analysis

Contoso is a company keeping track of the pets and is running real time analytics on top of the walks data it collects.

Each owner (Owner) has multiple dogs (Dog). Each dog is being taken for a walk, 3 times a day, by a pet sitter (PetSitter). When the walk is done a message (DogWalk) is emitted which is processed by the real time streaming platform.

Architecture

The overall architecture is the following:

Architecture diagram

  1. A console application named EventGenerator is emitting events to event hubs.

  2. These events are processed by a stream analytics job.

    2a. Cosmos DB can be used to ensure exact once message process.

  3. The Dog,Owner and PetSitter reference data are read from a blob storage. Note that it's a best practice to store the reference data with the date pattern in their path in order to support potential updates.

  4. The console application emits events with unknown DogId and PetSitterId. These records are routed to a blob storage output.

  5. The successfully augmented data are stored in a different blob storage.

  6. The data are streamed into a Power BI streaming dataset which is feeding a streaming dashboard.

  7. Stream analytics can potentially output in another Event Hubs which can then trigger additional processing either through a coded approach in Azure Functions or a designer approach hosted in Logic Apps.

Installation instructions

Infrastructure as Code (IaC)

There are two approaches to deploy the necessary resources. Deployed components

Option A) Deploy via portal and upload reference data manually

Deploy To Azure Visualize

After deploying, upload the reference data in the corresponding storage container.

Option B) Deploy using powershell

Open a powershell, navigate to the Deployment folder and login in azure

Connect-AzAccount -Subscription $SubscriptionId  -Tenant $TenantId

After that you can deploy the resources and upload the reference data by running the following command and providing the Resource Group name where you want to deploy the resources and the demo name which will act as a prefix for the resources you will deploy.

NOTE: DemoName should comply with storage account names, thus it can include numbers and lowercase letters only.

.\deploy.ps1
Supply values for the following parameters:
ResourceGroupName: 
DemoName: 

NOTE: To execute not signed powershell scripts in your host you may have to disable signature checking

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass

NOTE: This scripts also generates the secrets.json file needed in the EventGenerator project folder.

Configure Azure Stream Analytics job

Normally you would like to automate the deployment of Azure Stream Analytics job something that is feasible with the following steps:

For this demo, you will have to configure the job manually, something which will give you the opportunity to familiarize with Azure Stream Analytics.

Configure PowerBI output

In the portal, open the Stream Analytics job and navigate to Outputs (1). Select the output-pbi output (2) and click on the Renew authorization (3) to authorize the Azure Stream Analytics job to push data to the streaming dataset named demo-streaming-dataset. After authorizing you will be able to configure the target group workspace (4) where you want to deploy the streaming dataset. Click Save (5) to persist your changes.

Configure powerBI

Configure the streaming job

In the portal, open the Stream Analytics job and navigate to Query (1). Paste the following query in the query editor (2) and save the query (3).

With JoinedData As (
    SELECT Input.Id, Input.WalkDurationInMinutes, Input.PetSitterId,
      Dogs.Name as PetName, Dogs.Height, Dogs.Weight, Dogs.Length, Dogs.OwnerId,
      UDF.FormatNumber(Dogs.OwnerId, 5) As PartitionKey,
      PetSitters.FirstName + ' ' + PetSitters.LastName as PetSitterName, PetSitters.BirthDay as PetSitterBirthday, PetSitters.Rating, PetSitters.AverageWalkTimeInMinutes,
      Owners.FirstName + ' ' + Owners.LastName as OwnerName, Owners.BirthDay as OwnerBirthday
    FROM [input-event-hub] Input TIMESTAMP BY EventEnqueuedUtcTime
    Left Outer JOIN [ref-data-dogs] Dogs ON Input.DogId = Dogs.Id
    Left Outer JOIN [ref-data-petsitters] PetSitters On Input.PetSitterId = PetSitters.Id
    Left Outer JOIN [ref-data-owners] Owners ON Dogs.OwnerId = Owners.Id
), 
FullyEnrichedData As (
    -- Discard all records that are partially matched to the reference data
    Select * from JoinedData where OwnerId IS Not Null and PetSitterName is not Null and OwnerName is not Null
),
PartiallyEnrichedData As(
    Select * from JoinedData where OwnerId IS Null or PetSitterName is Null or OwnerName is not Null
)

-- Store in permanent store and output in powerBI
Select * Into [output-permanent-store-enriched] From FullyEnrichedData;
Select * Into [output-pbi] From FullyEnrichedData;

-- Store the incomplete data to the missing store in order to investigate later
Select * Into [output-permanent-store-missing] From PartiallyEnrichedData

Configure query

For common query patterns in Azure Stream Analytics, refer to this page.

Running the demo

Having all infrastructure deployed and the stream analytics job configured, it's time to run the demo.

Start the stream analytics job

In the portal, open the Stream Analytics job and click Start, ensure that Now is selected in the job output start time and click the Start button. Start streaming job

Run the emulator

Create a file named secrets.json in the EventGenerator project folder. The file should have the following format:

{
  "EventHubName": "name",
  "ConnectionString": "Endpoint=sb://name.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=blablabla"
}

where you can retrieve the connection string following these instructions.

NOTE: If you used deploy method b via powershell, the secrets.json is automatically generated for you.

Build an run the EventGenerator console application.

Emulator running

Verifying results

In the storage account you should be able to see in the dog-walks container two folders:

  • MissingRefs contains the records where either the owner or the dog or the pet sitter was not found in the reference data
  • Owners contains the enhanced records partitioned by the ownerId.

All records are stored with date in their path {yyyy}/{mm}/{dd} in order to be able to retrieve faster the records by date.

Storage Explorer

In PowerBI you should be able to see a streaming dataset created.

Streaming dataset

You can create a dashboard following the instructions of this article.

Streaming dashboard

References