All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added new parsing method to handle new data format
- Added schema versioning check for compatibility
- Added HTTP status codes for public key fetch failures
- Added more descriptive messages for EOF related errors
- Added OpenCHAMI middleware for logging and authentication
- Removed scope check from certain endpoints
- Removed arm64 build from goreleaser
- Updated go module dependencies to use temporary jwtauth with OpenCHAMI
- Added error check to jwtauth.NewKeySet
- Added jwtauth.NewKeySet to use entire JWKS for verification
- Added access token scope checks
- Added curl to container image
- Changed go-chi to v5
- Added
SMD_JWKS_URL
environment variable and updated default value - Added
jwksURL
parameter to fetch public key for validation - Added more middleware options, including one to strip slashes in URLS
- Fixed formatting strings
- Removed extra unnecessary logging
- Split routes into public and protected endpoints
- Fixed JWT verification implementation
- Added middleware to verify JWTs in router
- Removed required
xname
check fordoPartitionMembersPost
- Linter Errors
- Support local builds
- Support goreleaser for builds/releases
- Move to github.com/bikeshack/hms-smd
- Container now based on wolfi
- Deprecated built-in kafaka listener for redfish events
- Node architecture discovery
- Added RTS management switch discovery
- Reduced verbosity of V2 readiness/liveness requests in HSM logs.
- Fixed HSM seeing empty memory slots as populated in EX hardware.
- Added CT test coverage for most remaining APIs.
- Fixed several CT test bugs and made various improvements.
- Moved hardware-sensitive CT tests into separate stage.
- CASMHMS-5863 - Added Reacquire() function to the reservations client library.
- CASMHMS-5902: Linting of language in API spec (no content changes); corrected markdown formatting issue in changelog
- CASMHMS-5626 - Removed v1 API
- HSN NIC numbering for devices running proliant iLO redfish.
- Update service values test to allow custom role and subrole configurations.
- Restored previous name of GitHub Actions workflow file for test images.
CASMHMS-5747 - Refactored HSM CT tests for HMTH, including:
- Update HSM CT tests to use latest hms-test:4.0.0 image
- Break out HSM CT tests into non-disruptive, disruptive, and destructive test buckets
- Add many new API tests that execute in the runCT environment in the build pipeline
- Fixes to Swagger specification to reflect actual API behavior
- CASMHMS-5726 - Fixed syntax error in migration down step 22
- CASMHMS-5675 - HSM now discovers HSN NICs under '/redfish/v1/Chassis//Devices' for Proliant iLO redfish implementations.
- CASMHMS-5675 - HSN now ignores non-HSN NICs that show up for Proliant iLO redfish implementations.
- CASMHMS-5355 - /locks/reservations/check now reports the reservations that were not found.
- CASMHMS-5387 - HSM now discovers the power URL for Intel nodes.
- CASMHMS-5625 - Removed much of the redundant v1 API code in preparation for removal in CSM 1.4.
- CASMHMS-5610 - Fixed POST /Inventory/RedfishEndpoints returning 500 instead of 409 for conflicts.
- CASMHMS-5373 - Fixed locking bug preventing 'Flexible' requests from working.
- Updated CT tests to hms-test:3.1.0 image as part of Helm test coordination.
- CASMHMS-5591 - Added indexes to the hwinv_hist table to speed up migration step 20.
- CASMHMS-4972 - HSM now ignores A100 NV Switch and Baseboard when discovering GPUs on Proliant iLO devices.
- HSM now uses the 'Class' from SLS
- Converted image builds to be via github actions, updated the image links to be in artifactory.algol60.net
- Added a runCT.sh script that can run the tavern tests and smoke tests in a docker-compose environment.
- Refactored CT tests and their directory structure.
- Renamed dockerfiles.
- CASMHMS-5511 - Fixed issue causing FRU data to get improperly populated for empty locations.
- CASMHMS-4974 - Added HSM GET /v2/status/locks API.
- CASMINST-4069 - Updated HSM components test for new expected BMC management roles.
- Added flexible-model methods to the service-reservation package.
- CASMHMS-5278 - Added HSM SCN subscription database unit tests.
- CASMHMS-5365 - Updated HSM CT tests for custom roles and subroles.
- CASMHMS-5353 - Added CheckDeputyKeys() method to service_reservations package.
- CASMHMS-5348 - Fixed NULL value issue with POST /Inventory/EthernetInterfaces
- CASMHMS-633 - Added HSM hardware inventory unit tests.
- CASMHMS-632 - Added HSM component endpoint unit tests.
- CASMHMS-634 - Added HSM Redfish endpoint unit tests.
Replaced golang Sarama kafka interface with Confluent.
-
CASMTRIAGE-2801 - Added support for HPE PDUs to ComponentEndpoints CT test.
-
Added vaild state transitions section to the HSM swagger doc.
- Corrections to the HSM swagger doc including correcting typos, updating parameter descriptions to include valid values, updating parameter descriptions to include if they can be specified multiple times, and properly marking fields as required.
- Support for HPE PDUs
- CASMHMS-5205 - Rename HSM CT smoke tests to swap execution order.
- CASMHMS-5198 - Updated image refs in the chart.
- CASMHMS-5272 - Added support for AuxiliaryController Redfish subtype to ComponentEndpoints CT test.
- CASMHMS-5239 - HSM now kicks off re-discovery for nodeBMCs when a power on redfish event is received for its slot.
- CASMHMS-5233 - HSM correctly ignores duplicate xnames given as arguments to
POST /Inventory/Discover
- CASMHMS-5055 - Added SMD CT test RPM.
- CASMHMS-5226 - Add priority value to postgres cluster resource
-
CASMHMS-4951 - Changed HSM to use NAME and ProductPartNumber fields in place of empty Model and PartNumber fields for GPUs discovered on HPE hardware.
-
CASMHMS-4954 - Changed HSM to use the NAME field in place of an empty Model field for Enclosures.
- Changed cray-service version to ~6.0.0
- Changed the docker image to run as the user nobody
- CASMHMS-5039 - Added support for power capping for Bard Peak nodes.
- Workaround for discovery for Bard Peak to correctly discover node BMCs.
- Bulk postgres operations trying to operate on the same row multiple times.
- CASMHMS-5041 - Set the 'Name' field in the power control struct for Apollo 6500.
- CASMHMS-5036 - Updated the discovery status CT smoke test with troubleshooting steps.
- CASMHMS-4835 - Changed HSM postgres operations to use bulk Inserts and Updates when working with multiple entries.
- Added GitHub configuration files.
- CASMTRIAGE-1808 - Updated the ComponentEndpoints CT test for multiple accelerator components.
- CASMHMS-4885 - Set pod priority for HSM.
- CASMHMS-4990 - Add "HPE" to the match list for Cray manufacturer.
- github transition phase 3. Remove stash references.
- Added Jenkins file and Makefile for migrating hms-smd to github.
- CASMHMS-4927 - smd-init prunes previously bloated hwinv_hist database tables of redundant hardware history events.
- CASMHMS-4927 - FRU history events are only generated if a change occurred.
- CASMHMS-4971 - Fixed HSM crashing when discovering Bard Peak nodes
- CASMHMS-4930 - Enabled automatic postgres backups in the helm chart.
- CASMINST-2680 - Updated CT tests for when ncn-m001 is not part of the management cluster.
- CASMHMS-4898 - Updated base container images for security updates.
- CASMHMS-4884 - Fixed HSM crashing when manually adding power supplies via POST /Inventory/Hardware
- CASMINST-2511 - Update the ComponentEndpoints CT test to make InterfaceEnabled an optional EthernetNICInfo field and add it to RedfishSystemInfo.
- CASMHMS-4842 - HSM now joins a client group with its replicas to share at pool of redfish events from the kafka bus
- CASMHMS-4865 - Fixed component filtering when locking components.
- CASMHMS-4706 - Added support for power capping HPE Apollo 6500.
- CASMPET-4148 - Change smd-postgres pvc size to 100GB
- CASMHMS-4834 - Modifies Insert, Delete, and Update postgres operations on the v2 locking interface use bulk operations.
- CASMHMS-4836 - Support for parsing redfish events from HPE iLo nodes
- Changed kubernetes values.yaml for podAntiAffinity from istio-ingressgateway
- Updated docker-compose files to pull images from Artifactory instead of DTR.
- CASMHMS-4796 - HSM no longer takes out row exclusive locks in postgres.
- CASMHMS-4796 - Reuses http transport whenever possible.
- CASMHMS-4796 - Pod resources are increased for both HSM and postgres.
- CASMHMS-4796 - Readiness probe timeout is increased.
- CASMHMS-4796 - Set GOMAXPROCS to tune HSM to the CPU resource limits.
- CASMHMS-4796 - Unset SetConnMaxLifetime() so postgres connections can be reused.
- CASMHMS-4796 - Set indexs on role/subrole rows in the components table
- CASMHMS-4810 - Allow valid nodeAccel type xnames for more than 8 GPUs
- CASMHMS-4811 - Added anti-affinity for HSM to avoid (if possible) scheduling on the same nodes as the Istio gateways.
- CASMHMS-4794 - Disabled CT test for the DiscoveryStatus API.
- CASMHMS-4751 - Increased the wait-for-postgres resource limit
- CASMHMS-4719 - Fix HSM postgres slowness during discovery floods on large (2000+ nodes) systems.
- CASMHMS-4719 - Changed FRU tracking to be more simple and avoid long running sql queries.
- CASMHMS-4700 - HSM now discovers GPUs in PCI slots on HPE hardware
- CASMHMS-4713 - Fix HTTP response leaks
- CASMHMS-4693 - Update HSM Hardware Inventory CT test to allow empty drive bays.
- CASMHMS-4709 - Update HSM Hardware Inventory CT test to allow more ProcessorType data values.
- CASMHMS-4593 - PATCH /v2/Inventory/EthernetInterfaces/ now allows ComponentID only patches
- CASMHMS-4579 - Update the cray-service chart to 2.4.5.
- CASMHMS-4605 - Update the loftsman/docker-kubectl image to use a production version.
- Added a note in HSM v1 and v2 Swagger about v1 deprecation.
- Added User-Agent header to outbound HTTP requests.
- Updated to MIT License
- Updated HMS libraries to latest
- Code changes to test.go code for updates to hms-cert
- CASMHMS-4334 Fixed issue with Processor discovery
- Updated license file.
- CASMHMS-4295 - Changed partitions API to restrict partition names to the p# or p#.# (hard.soft) naming convention for partitions so they will work correctly with other APIs.
- CASMHMS-4260 - Change NodeHsnNic hardware inventory data to show as NodeHsnNicFRUInfo instead of HSNNICFRUInfo.
- CASMHMS-4246 - Fixed HSM using invalid MAC addresses to generate EthernetInterface entries.
- CASMHMS-4240 - Change NodeAccel hardware inventory data to show as NodeAccelFRUInfo instead of ProcessorFRUInfo.
- CASMHMS-4237 - Update NodeAccelRiserFRUInfoRF definitions: remove Manufacturer, add Producer and EngineeringChangeLevel
- CASMHMS-4224 Added the discovery for NetworkAdapters (NodeHsnNic HMS types) to HSM
- CASMHMS-4087 Added the NodeAccelRiser type to represent GPUSubsystem baseboards, ie Redstone
- CASMHMS-4211 - Added final CA bundle configmap handling to Helm chart.
- CASMHMS-4158 - The V2 API for Component Ethernet Interfaces now supports associating multiple IP addresses to a single MAC Address. The new IP Address structure has a optional Network field to note which network an IP Address is apart of. Added new APIS to manipulate the IPAddresses
- The V1 Component Ethernet Interfaces API remains unchanged, except for new behavior when performing a PATCH on a component ethernet interface with a new IPAddress that has multiple IP addresses it will return a Bad Request status code as this is a ambiguous situation.
- CASMHMS-4077 - HSM now periodically updates the timestamp of currently running discovery jobs.
- CASMHMS-4077 - Much of the HSM manual rediscovery path has been parallelized
- CASMHMS-3848 - HSM now queries HBTD for heartbeat status of nodes it discovers in the 'On' state to see if they should be 'Ready'.
- CASMHMS-3232 - HSM now retries sending failed SCNs.
- CASMHMS-4148 - Update HMS vendor code for security fix.
- Set grpc go module to v1.29.1 to resolve smd-related grpc/etcd incompatibility issue.
- CASMHMS-4144 - Update to latest cray-service base chart v2.2.0 to pick up postgres labels.
- CASMHMS-4105 - Updated base Golang Alpine image to resolve libcrypto vulnerability.
- Added a V2 of SMD; V1 is now on the deprecation path. We have added a new locking and reservations API
- CASMHMS-4111 - Added a POST to the /Inventory/Hardware REST endpoint to generically add hw inventory entries from external sources.
- CASMHMS-4111 - Removed HSNInterfaces APIs and functionality
- Added support for TLS certs for Redfish endpoint communcations.
- CASMHMS-4026 - HSM now correctly resyncs its ComponentEndpoint cache when a redfish event comes from a PDU controller.
These are changes to charts in support of:
- moving to Helm v1/Loftsman v1
- the newest 2.x cray-service base chart
- upgraded to support Helm v3
- modified containers/init containers, volume, and persistent volume claim value definitions to be objects instead of arrays
- the newest 0.2.x cray-jobs base chart
- upgraded to support Helm v3
- CASMHMS-3997 - Updated hms-smd to use latest trusted baseOS images.
- CASMHMS-4018 - Added code to process GPU info from redfish correctly
- CASMHMS-3975 - Added a mechanism for restarting orphaned discovery jobs
- CASMHMS-3509 - Added the hms-base config file into the HSM chart
- CASMHMS-3807 - Changed PDU discovery behavior to discover outlets as CabinetPDUPowerConnector HMS type.
- CASMHMS-3914 - Changed HSM to skip node discovery for CMCs with special NodeBMC xname xXcCsSb999
- CASMHMS-3888 - Changed PDU discovery behavior to allow Cabinet PDU controllers to have more than 1 Cabinet PDU.
- CASMHMS-3871 - Added PowerStatusChange to the list of valid redfish event types for HSM to process.
- CASMHMS-3818 - CT functional test updates for /State/Components SubRoles and /SCN States.
- CASMHMS-3815 - Bumped the resource limits and made the compose file work.
- CASMHMS-1466 - Added partition query parameters to /Inventory/Hardware
- CASMHMS-1466 - Added the 'parent_node' column to the hwinv_loc table to be able to associate lower components with partitions of their parents
- CASMHMS-1466 - Added a schema view that includes partition information with hwinv data.
- CASMHMS-1466 - Added the 'laststatus' query parameter to /Inventory/RedfishEndpoints to allow queries to be filtered based on discovery status.
- CASMHMS-2921 - Fru Tracking of sC
- CASMHMS-2919 - Fru Tracking of nC
- CASMHMS-3617 - Changed 'PATCH /Inventory/EthernetInterfaces' to include 'CompID' as a patchable value.
- CASMHMS-3462 - HSNInterfaces REST API which includes GET/POST/DELETE /Inventory/HSNInterfaces and GET/PATCH/DELETE /Inventory/HSNInterfaces/{xname}
- CASMHMS-3575 - Disabled CT test for /Defaults/NodeMaps since it is deprecated in favor of SLS.
- CASMHMS-3553 - Updated HSM /State/Components CT test cases for optional 'SubRole' and 'Subtype' fields.
- CASMHMS-3506 - HSM now treats Ready/Warning StateData patches as only affecting components in the Ready state.
- CASMHMS-3526 - fixed job cleanup.
- CASMHMS-3531 - Updated HSM /State/Components CT test case for optional 'SoftwareStatus' field.
- CASMHMS-3532 - Updated HSM /Subscriptions/SCN CT test case for new subscription keys.
- Re-inventory triggered by redfish events now only generate "Scanned" hardware history events.
- removed cray-smd-loader job per CASMHMS-3392
- Added a locking mechanism for the HSM jobList to prevent crashes.
- Updated the cray-service chart version.
- Changed smd-init to downgrade as well as upgrade schemas
- smd-init is now built in the same container image as HSM
- Added a job to delete the previously run smd-init and smd-loader jobs for upgrade/downgrade
- Added a persistent storage volume for storing all previously applied schema migration steps
- replicaCount now set to 3 in helm chart for resiliency
- Added a REST API for storing and querying for component ethernet interfaces
- CASMHMS-2966 - Update hms-smd build to use trusted baseOS.
- Update version of hms-base to 1.7.3, which includes changes for CASMHMS-3403: modifications to xname validation for CMMRectifiers
- Increased the size of the fru_id column from varchar(63) to varchar(255) in the hwinv_by_loc, hwinv_by_fru, and hwinv_hist HSM database tables.
- Added more robust fruid validation to the fruid generation function.
- CASMHMS-3241 - Update Redfish endpoint CT test for optional IPAddress field.
- HSM now sets detached FRUs associated with a disabled RedfishEndpoint from their loc.
- HSM generates "removed" events in hardware history when RedfishEndpoints are disabled
- Fixed a bug in the hardware history cleanup logic causing all history to get deleted each day.
- Fixed a bug in node standby polling jobs causing them to match powerstate stings case-sensitively.
- CASMHMS-3096 - added FRU tracking support for power supplies, specifically CMMRectifiers and NodeEnclosurePowerSupplies
- HSM now sets components associated with a disabled RedfishEndpoint to 'Empty'
- HSM now correctly processes the NULL partition parameter correctly for GET /groups//members
- Added the IPAddress field to the RedfishEndpoints API as a patchable and a queryable field.
- CASMHMS-3211 - Update Redfish endpoint CT test for chassis and router BMCs.
- Added a configmap volume mount to the cray-smd deployment to mount as an updatable configfile.
- Added a config file watcher to pick up any new roles/subroles defined in the config file.
- Added /service/values/* REST APIs to list valid values for hms-base enums.
- Changed the valid component role and subrole values to be extendable via configfile.
- CASMHMS-3163 - Add additional cleanup actions for test interrupts to HSM group and partition CT tests.
- CASMHMS-3097 - Update Redfish Pkg by standardizing FRUID initialization.
- CASMHMS-2929 - Update Redfish Pkg by adding SerialNumber to Processor data.
- CASMHMS-3137 - Update HSM CT test for /State/Components to include new 'Class' field.
- Transitioning a component from Ready to On is no longer a valid state transition
- Redfish events are now processed concurrently
- 405 responses to include Allow header with list of allowed HTTP methods
- Information under the /State/Components REST API now includes the component Class (River/Mountain).
- Fixed SLS URL.
- Made Docker compose work. Running
docker-compose up -d
in the root directory now gives you a working HSM with Vault.
- HSM now delays discovery when processor info is not populated when discovering nodes.
- Update discovery functions in pkg/redfish to use a default flag of "OK"
- Create standard FRUID initialization/validation function, apply to Memory and Chassis
- HSM segfault when generating hardware history entries.
- Updated FRUID initialization code for MemoryMods to use unique identifier
- Added SMD_HWINVHIST_AGE_MAX_DAYS environment variable to control when FRU history entries should be cleaned up. This defaults to 365.
- HSM generates FRU historical data.
- CASMHMS-3007 - redact passwords from redfish struct output.
- Added PATCH /hsm/v1/Inventory/RedfishEndpoints/{xname}
- Database version checking now looks for installed schema versions greater than or equal to the expected schema.
- Added functionality to hmsds to store hardware inventory historical data.
- Added /hsm/v1/Inventory/Hardware/History REST endpoint (GET/DELETE)
- Added /hsm/v1/Inventory/Hardware/History/{xname} REST endpoint (GET/DELETE)
- Added /hsm/v1/Inventory/HardwareByFRU/History REST endpoint (GET)
- Added /hsm/v1/Inventory/HardwareByFRU/History/{fruid} REST endpoint (GET/DELETE)
- CASMHMS-1009 - added support for disks
- CASMHMS-2908 - RedfishEndpoints API test workaround for Intel firmware v1.93 UANs failing discovery CASMHMS-2767.
- CASMHMS-2860 - Updated CT test for Hardware FRU tracking API additions.
- Updated imports to use new hms-base, hms-compcredentials, hms-securestorage, and hms-msgbus repos in place of deprecated hms-common versions.
- Liveness probe & settings
- Only log probes when DEBUG or higher
- Increased k8s initialDelaySeconds and periodSeconds
- Added query parameters to /hsm/v1/Inventory/Hardware REST endpoint
- Added query parameters to /hsm/v1/Inventory/HardwareByFRU REST endpoint
- Added query parameters to /hsm/v1/Inventory/Hardware/Query/{xname} REST endpoint
- Implemented /hsm/v1/Inventory/Hardware/Query/{xname} to accept more xnames than just "s0"
- Increased size of postgresql volume to 30Gi.
- Additional functional Tavern API tests for CT framework.
- Functional Tavern API tests for CT framework.
- Updated version of hms-common.
- Redfish node discovery now waits for all info to be loaded from BIOS
- Improved retry logic in loader to essentially retry forever.
- Subroles to HSM
- HSM now reloads node hwinv when nodes power on.
- Added an Enabled field to ComponentEndpoints as a reference to the same field in the parent RedfishEndpoint.
- Workaround added for gigabyte nodes with missing Ethernet Interfaces
- Istio preventing HSM from receiving redfish events
- Reduced HSM's default log verbosity
- Nodes staying in the Standby state when they don't send redfish events.
- Support for using SLS to get NID and Role assignments for nodes
- The CrayAlerts registry to the list of valid registries for ResourcePowerStateChanged redfish events
- GET /hsm/v1/service/ready REST API for HSM health checks
- Liveliness and readiness probes for the HSM deployment now point to GET /hsm/v1/service/ready
- The hmsds log level now gets set to match HSM's log level.
- Missing query parameters, enabled and softwarestatus, in the swagger doc.
- Added Oids to the PowerControl struct
- Discovery of EPO redfish endpoints for chassis.
- POST to /hsm/v1/State/Components
- PUT to /hsm/v1/State/Components/
- PowerControl data discovery for non-mountain components
- GET /locks returns all locks instead of get the first.
- REST API for PowerMaps.
- Redfish credentials from REST API output.
- Power Control Info discovery for mountain nodes
- REST API for component locking.
- Gigabyte node enclosure discovery.
- Support for parsing redfish events from updated Gigabyte nodes
- Added new loader utility which is used to load HSM's Node NID map.
- Changes from hms-common where picked up to include that addition of the 'Management' role.
- The 'Management' role to the HSM swagger document.
- Vault operations were added to smd. Configurable via the 'SMD_RVAULT' and 'SMD_WVAULT' environment variables.
- Vault environment variables, 'VAULT_ADDR' and 'VAULT_SKIP_VERIFY', to values.yaml to point HSM to a Vault instance.
- Product specification to the jenkins file
- Unused Mariadb code
- yamllint errors and warnings
- Segfault when if database transactions can't be started
- Temp file creation in testing OS independant.
- AllowableValues for outlet power control
- Schema change for redfish resetTypes
- Fixed bug in chart with incorrect
imagesHost
setting.
- Postgres is now the default and only supported backing store
- cray-smd now uses helm and the Postgres operator.
- cray-smd-init has been re-written to install/upgrade the schema for postgres using golang-migrate.
- Add rediscovery for RedfishEndpoints on PUT updates, with related bug fix.
- Fix xname normalization issues, group/partition normalization issues
- Fix bad 500 status responses that don't pass through an HMS error and return 400 like they should for a bad request. These aren't internal DB errors and we don't want to report them that way.
- Added support for PDU discovery.
- Added /ServiceEndpoints/* REST endpoints to HSM for querying for information on discovered redfish services.
- Added discovery logic to HSM to discover redfish services.
- Added storage logic to hmsds to store discovered redfish service information.
- Added logic to HSM to check for correct schema version.
- Changed the table view for service_endpoint_info to correct extracting FQDN info for the redfish endpoint.
- Brought in
redfish
,sharedtest
, andsm
packages to this repo as they're really specific to HSM. - Broung in
hmsds
package to the internal part of this repo as it shouldn't even be used by any other services. - Checked in vendor code for 3rd party dependencies.
- Updated Dockerfile to now copy over new
pkg
andinternal
folders when building.
- Old version (v1.0.0) of hms-common code.
- This is the initial release. It contains everything that was in
hms-services
at the time with the major exception of beinggo mod
based now.