Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source extensions should be composable #116

Merged

Conversation

Apollo3zehn
Copy link
Member

@Apollo3zehn Apollo3zehn commented Jul 2, 2024

This merge requests adds a pipeline feature to Nexus so that multiple IDataSources can be chained. This solves two problems:

    1. It is possible to alter catalogs and resources, i.e. with the parallel developed data source Nexus.Sources.Transform users of Nexus can now rename resources, derive units from channel names, assign default groups, etc.
    1. Catalogs can be augmented with additional resources, i.e. two or more data sources can now be responsible for a single catalog. This is useful for plugins that derive data from raw data and where the derived data (i.e. new resources) should be located in the same catalog. This is also useful for cases where data is located in the same folder structure but different file formats. Normally data from all file types in the same folder belong together but will be handled by different plugins / data sources.

To distinguish which data source should handle which data requests, every resource gets an integer property assigned under the path nexus/pipline-position:

grafik

This position is set by Nexus when the individual data sources return their resource catalogs. It is then later used to distribute ReadRequests to the corresponding data sources.

At the same time this additional piece of metadata is useful in making the data processing pipeline more tracable so that users can always find out which version of a software and which configuration led to the specific set of data. In future we should add Git support and create a commit every time the configuration changes. The current commit ID will then become part of the catalog metadata (#119).

A frequent change to Nexus source code is the renaming of DataSourceRegistration to Pipeline. This was necessary because we now do not only have a single DataSourceRegistration to provide a set of catalogs but multiple DataSourceRegistrations which compose a pipeline.

There are also many changes regarding to the "format on save" feature, i.e. often useless spaces have been removed. Or I have reformatted some individual LOC without changing their meaning.

Here are some comments to the individually changed files:

.github/workflows:
- Use specific pyright version because the new one causes type checking errors (however, this means we need to solve the python type errors in near future, #124)

.vscode/settings.json
- Exclude .razor files from editor.formatOnSave because this produces incorrect files

notes/plugin-pipeline.excalidraw
- A drawing which shows the pipeline feature, can be ignored

openapi.json
- This is an auto-generated file for the Swagger UI (https://nexus.iwes.fraunhofer.de/api), can be ignored

src/Nexus.UI/Components/CatalogAboutView.razor
- Since we now have a list of data sources (the pipeline), the data source info page has been adapted to display data per data source

src/Nexus.UI/Core/AppState.cs
- The CatalogInfo type (contains display info for the UI) had to be adapted so that info about all pipeline members (data sources) can be provided to the UI

src/Nexus.UI/Core/NexusDemoClient.cs
- Same as before

src/Nexus.UI/ViewModels/FakeResourceCatalogViewModel.cs
- same as before

src/Nexus/API/CatalogsController.cs
- same as before
- mainly renaming from DataSourceRegistration to Pipeline
- line 293 (old) / 301 (new): I made the extension method JsonElement.GetStringValue a bit more efficient by reducing the number of string.Split operations which means that now the first parameter is an array instead of a path-like string. This change will occur in other files as well

src/Nexus/API/SourcesController.cs
- Previously the user-specific DataSourceRegistration configuration was part of the project.json file in the Nexus configuration folder. This has been factored out and is now part of the user specific folders (also in the Nexus configuration folder):

grafik

The file pipelines.json contains all user-configured pipelines and the pipelines itself are managed by the newly created service PipelineService and the file system interaction is handled by the already existing DatabaseService which are both injected into this file (src/Nexus/API/SourcesController.cs).

The REST API code in this file has been adapted to let users interact with pipelines instead of data source registrations.

src/Nexus/API/UsersController.cs
- The type InternalDataSourceRegistration became superfluous and has been removed. Now DataSourceRegistration is used everywhere instead

src/Nexus/Core/CatalogContainer.cs
- This file mainly follows the name changes and the fact that we now have to handle arrays instead of single object DataSourceRegistrations

src/Nexus/Core/Models_NonPublic.cs
- As described above, previously the DataSourceRegistrations were part of project.json which the type UserConfiguration belonged to. Now that DataSourceRegistrations are living in their own pipeline.json files, the type UserConfiguration is not required anymore

src/Nexus/Core/Models_Public.cs
- see - same as before comments above
- There is now a DataSourcePipeline type which is similar to the old DataSourceRegistration type except that now we have a list of DataSourceRegistrations

src/Nexus/Extensibility/DataSource/DataSourceController.cs
- This is the core of the changes: Here Nexus has to handle the new pipeline approach, i.e. answers to questions like What to do with multiple GetTimeRange() return values? (because we now have multiple data sources), and more.
- The solution for multiple GetAvailability() responses is to calculate the average
- The old extension method GetCatalogAsync has been renamed to EnrichCatalogAsync because now every data source gets the catalog returned by the data source which is located earlier in the pipeline. The first data source gets an empty catalog.
- Data sources get only read requests passed for resources which belong to the current pipeline position

src/Nexus/Extensions/Sources/Sample.cs
- mainly just adapt to other code changes
- Line 150 (old) / 149 (new): there now a new tuple parameter called originalResourceName in method ReadAsync. This one became necessary because with the pipeline approach resource IDs can be modified by data source which come later in the pipeline. So the data source which originally provided a resource with a specific ID (= name) cannot rely anymore on the resource ID in the ReadAsync method. Therefore Nexus ensures that every resource has an orignal-name property:

grafik

This property can be deliberately set by a data source or - in case the data source doesn't do this - Nexus will do it for you so that this value is never null. So the originalResourceName will now always be part of a ReadRequest.

src/Nexus/Extensions/Writers/Csv.cs
- follow previous code changes

src/Nexus/Program.cs
- register the PipelineService for DI

src/Nexus/Services/AppStateManager.cs
- Data source registrations are now managed by PipelineService, so remove the unnecessary code from here

src/Nexus/Services/CatalogManager.cs
- follow previous code changes

src/Nexus/Services/DataControllerService.cs
- follow previous code changes

src/Nexus/Services/DataService.cs
- follow previous code changes

src/Nexus/Services/DatabaseService.cs
- Extend this service with functionality to handle pipeline data

src/Nexus/Services/PipelineService.cs
- The pipeline service (handles creation, deletion and retrieval of pipelines per user)

src/Nexus/wwwroot/css/app.css
- auto-generated by Tailwind (can be ignored)

src/clients/dotnet-client/NexusClient.g.cs
- auto-genereated (can be ignored)

src/clients/python-client/nexus_api/_nexus_api.py
- auto-genereated (can be ignored)

src/extensibility/dotnet-extensibility/DataModel/DataModelExtensions.cs

  • Since Nexus now actively relies on the presence of the original-name resource property, there is a helper method to create it. This already existed in the project Nexus.Sources.StructuredFile but has been moved over into this project
  • All code which ensures the presence of mandatory catalog and resource properties has been moved over here toa central place
  • The catalog properties now look a bit different. This is to make the .json object a bit more compact

new:
grafik

old
grafik

src/extensibility/dotnet-extensibility/DataModel/PropertiesExtensions.cs
- As mentioned before, the number of string.Split() operations has been reduced to make property access more efficient. Internally catalog and resource properties are represented by a JsonElement and unfortunately it is a bit of work to access nested JSON data. That is the reason why this class exists.

src/extensibility/dotnet-extensibility/DataModel/ResourceCatalog.cs
- follow previous code changes

src/extensibility/dotnet-extensibility/Extensibility/DataSource/DataSourceTypes.cs
- follow previous code changes

src/extensibility/dotnet-extensibility/Extensibility/DataSource/IDataSource.cs
- GetCatalogAsync has been renamed to EnrichCatalogAsync and the parameters changed

src/extensibility/python-extensibility/nexus_extensibility/_extensibility_data_source.py
- mirror C# changes to Python`

tests/Nexus.Tests/DataSource/DataSourceControllerFixture.cs
- this unit test fixture prepares test data, i.e. it prepares data source registrations (now two instead of one because we want to test the new pipeline behavior)

tests/Nexus.Tests/DataSource/DataSourceControllerTests.cs
- Tests have been adapted to the pipeline feature

tests/Nexus.Tests/DataSource/SampleDataSourceTests.cs
- follow previous code changes

tests/Nexus.Tests/DataSource/TestSource.cs
- A data source to be used in the tests and which modifies existing resources and adds a new resource to the catalog. This data source is placed in pipeline position 1, i.e. after the actual data source

tests/Nexus.Tests/Other/CatalogContainersExtensionsTests.cs
- Mainly code format changes
- follow previous code changes

tests/Nexus.Tests/Other/PackageControllerTests.cs
- follow previous code changes

tests/Nexus.Tests/Services/CatalogManagerTests.cs
- Tests have been adapted to the pipeline feature

tests/Nexus.Tests/Services/DataControllerServiceTests.cs

  • Tests have been adapted to the pipeline feature

tests/Nexus.Tests/Services/DataServiceTests.cs
- follow previous code changes

tests/Nexus.Tests/Services/PipelineServiceTests.cs
- tests for the PipelineService

tests/Nexus.Tests/Services/TokenServiceTests.cs
- fix warnings

tests/TestExtensionProject/TestDataSource.cs
- follow previous code changes

@Apollo3zehn Apollo3zehn linked an issue Jul 2, 2024 that may be closed by this pull request
@Apollo3zehn Apollo3zehn marked this pull request as ready for review July 12, 2024 12:35
@Apollo3zehn Apollo3zehn requested a review from Conundraah July 12, 2024 13:56
@Apollo3zehn Apollo3zehn self-assigned this Jul 12, 2024
@Apollo3zehn Apollo3zehn force-pushed the 57-data-source-extensions-should-be-composable-see-text branch from 3ca0426 to 5ce8564 Compare July 22, 2024 18:12
@Conundraah
Copy link
Contributor

[QUESTIONS]:

"To distinguish which data source should handle which data requests, every resource gets an integer property assigned under the path nexus/pipline-position:

This part is not really clear to me: Where exactly is nexus/pipline-position located, is it in the datasource's accompanying resource.json file?

This position is set by Nexus when the individual data sources return their resource catalogs. It is then later used to distribute ReadRequests to the corresponding data sources."

Is the pipeline position assigned temporarily or locally? Does it change everytime when the individual data source return their resource catalogs?

@Apollo3zehn
Copy link
Member Author

Apollo3zehn commented Jul 23, 2024

[QUESTIONS]:

"To distinguish which data source should handle which data requests, every resource gets an integer property assigned under the path nexus/pipline-position:

This part is not really clear to me: Where exactly is nexus/pipline-position located, is it in the datasource's accompanying resource.json file?

With nexus/pipline-position I mean the json hierarchy as shown in the screenshot below:

grafik

Here you can see that the pipeline-position property is within the object nexus, i.e. the path is nexus/pipline-position. So this property is part of the resource's metadata (json) and it is being injected by Nexus itself, so the data sources do not need to worry about it.

This position is set by Nexus when the individual data sources return their resource catalogs. It is then later used to distribute ReadRequests to the corresponding data sources."

Is the pipeline position assigned temporarily or locally? Does it change everytime when the individual data source return their resource catalogs?

It is assigned on the fly by Nexus and only stored in memory. The assignment occurs here in a method named EnsureAndSanitizeMandatoryProperties:

[PIPELINE_POSITION_KEY] = pipelinePosition,

The value only changes, when the pipeline definition (pipelines.json) changes, otherwise Nexus will always assign the same number. The first data source in a pipeline gets the 0 assigned, the second data source the 1, and so on.

src/Nexus/Services/DatabaseService.cs Outdated Show resolved Hide resolved
src/Nexus/API/UsersController.cs Outdated Show resolved Hide resolved
src/Nexus/API/UsersController.cs Show resolved Hide resolved
src/Nexus/API/SourcesController.cs Outdated Show resolved Hide resolved
src/Nexus/Extensibility/DataSource/DataSourceController.cs Outdated Show resolved Hide resolved
tests/Nexus.Tests/Services/DataServiceTests.cs Outdated Show resolved Hide resolved
@Apollo3zehn Apollo3zehn merged commit 64c9bc6 into dev Jul 29, 2024
3 checks passed
@Apollo3zehn Apollo3zehn deleted the 57-data-source-extensions-should-be-composable-see-text branch July 29, 2024 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data source extensions should be composable (see text)
2 participants