I was asked to showcase my debugging process using a tricky bug I recently encountered and fixed. One particularly annoying bug that I fixed recently involved a typo in a configuration of an Azure Functions application. Since the bug in question occurred in a closed source application, I prepared a contrived code sample basing on a similar configuration bug recently discovered in Civilization VI .
I wrote and tested this code sample using the following software. Be advised
that azure-functions-core-tools
is being actively developed, so the code
sample may not necessarily work with a different version.
dotnet-sdk
, version used:2.1.101
VS Code
, version used:1.21.1
azure-functions-core-tools
, version used2.0.1-beta.24
mono
, version used:5.8.0.129
I was working on an Azure Functions application that was calling some external libraries written in C# and F#. The configuration for the application was stored in the environment variables set on the target deployment. After one particular deployment, I noticed that integration tests starting failing, even though the function seemed straightforward and all unit tests passed prior to deployment.
In the original bug I fixed, the failure was quite complicated, as it involved
interplay of multiple libraries and a NullReferenceException
caused within F#
code. In this sample, I simplified it to returning a to better illustrate the
underlying cause.
For further simplicity, I ommited the unit tests and the integration tests and I will be
testing the Azure Function locally using curl
.
We expected that a GET
request to /api/sample
should return a JSON similar to
the following:
{
"scienceReport":"Our current science yield is 10",
"timeStamp":"2018-03-21T13:39:48.809166+00:00"
}
where scienceReport
field is configured by setting the YIELD_SCIENCE
environment variable on target deployment, and timeStamp
is the current time
stamp.
However, in our case the integration tests are failing as the GET
request to
/api/sample
endpoint returns the following JSON:
{
"scienceReport":"No Science Yield!",
"timeStamp":"2018-03-21T13:39:48.809166+00:00"
}
After quickly checking the target deployment configuration everything seems fine, so we will have to take a closer look at the problem.
After checking out the source code for the application, we need to see if we can reproduce the problem locally.
In a real-life scenario we would use azure-function-core-tools
to download the
configuration from the affected Function App deployment and place it in a
local.settings.json
file. The environment variables from the target
deployment are stored in the Values
entry in local.settings.json
file, and
function host will automatically use them to populate the function application
environment.
For simplicity, I already provided local.settings.json
that contains the configuration of the target deployment.
Now, let's build and start the function application:
$ cd XTeamDebuggingSample
$ dotnet restore
$ dotnet build
$ cd src/FunctionsApp/bin/Debug/netstandard2.0/
$ func host start
Alternatively, we could open the folder containing XTeamDebuggingSample.sln
in
VS Code and run the following tasks in the following order:
- Build All
- Run Function Host
After starting the function host, we should see the following message in the terminal:
Listening on http://localhost:7071/
Hit CTRL-C to exit...
Http Functions:
SampleTrigger: http://localhost:7071/api/sample
Now, we can open a new terminal window and use curl to test the function and reproduce the problem on our machine:
$ curl localhost:7071/api/sample
{
"scienceReport":"No Science Yield!",
"timeStamp":"2018-03-21T13:54:11.063377+00:00"
}
The GET
request send to the /api/sample
endpoint returned a JSON object with
an incorrect value for the scienceReport
key. After reading the code for
SampleTrigger
function, we find the following code for constructing the
response (SampleTrigger.cs:55-65
):
string report;
if (_science.Yield > 0) {
report = Report.Report(_science.Yield);
} else {
report = "No Science Yield!";
}
var response = new { ScienceReport = report, TimeStamp = DateTime.Now};
return new OkObjectResult(response);
and we are able to locate the code that initialises _science
(SampleTrigger.cs:22-31
):
static SampleTrigger() {
var builder = new ConfigurationBuilder()
.AddEnvironmentVariables();
var configuration = builder.Build();
var yield = configuration.GetValue<int>("YIELD_SCIENCE");
_science = new Science(yield);
}
Given that, we have two potential culprits:
- either there is a bug in the implementation of the
Science
class, or - there is a bug in the configuration.
Since we already noted that the unit tests for external libraries were passing,
which should include the unit tests for the Science
class, therefore we will
investigate the configuration first.
We set up a breakpoint on the following statement on the line 30 of SampleTrigger.cs
:
_science = new Science(yield);
Since the code in question is in the static constructor of the SampleTrigger
class, we need to restart the function host before attaching the debugger. The
static constructor for the SampleTrigger
class will be executed the first time
we make a request to the api/sample
endpoint, therefore we need to attach the
debugger before making any requests to the function application.
We will attach the debugger using the .NET Core Attach
launch configuration in
VS Code. We will attach to the func
process, which represents the process of
the function host.
After attaching the debugger, we make another curl
request to the api/sample
endpoint:
$ curl localhost:7071/api/sample
In VS Code we should now see that the breakpoint was hit.
Using either the Variables pane, or the debugger console we can inspect the state of the local variables:
- after inspecting the
yield
variable, we notice that it is set to0
. - after inspecting what keys got loaded to the configuration we notice that the
configuration does not contain an entry for
YIELD_SCIENCE
, but an entry forYEILD_SCIENCE
.
Note that, if we want to list the entries in the configuration
object from
the debugger console, we need to evaluate the following expression:
((Microsoft.Extensions.Configuration.EnvironmentVariables.EnvironmentVariablesConfigurationProvider)System.Linq.Enumerable.First(configuration.Providers)).Data.Keys
After checking the configuration file we can indeed confirm that there is a typo
in the YEILD_SCIENCE
entry:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "",
"YEILD_SCIENCE": 10
}
}
It seems that we have found the culprit---there is a typo in our configuration!
Now, a case when a critical part of configuration is missing should correspond to a hard failure, as in most cases it is impossible for the system to recover. In some cases it may seem that having a well known default configuration is a good back-up option, however it may lead to silent failures for incorrect configuraiton, as in our case.
Therefore we will update the SampleTrigger
static constructor to raise an
exception when the configuration is missing.
The updated constructor is as follows (SampleTrigger.cs:35-45
):
// FIX 1: Make lack of configuartion a hard failure
static SampleTrigger() {
var builder = new ConfigurationBuilder()
.AddEnvironmentVariables();
var configuration = builder.Build();
var yield = configuration.GetValue<int?>("YIELD_SCIENCE")
?? throw new InvalidOperationException("Missing configuration for YIELD_SCIENCE");
_science = new Science(yield);
}
Before updating the configuration, we need to make sure that the fix we implemented does the job and will throw an exception for the current, incorrect configuration.
We re-build everything and restart the function host. If the build command fails
due to a locked dll file, we need to stop function host before building. After
all of that, we make another curl
request to the /api/sample
endpoint:
$ curl localhost:7071/api/sample
After making the request, we check that the function host output to see that the
static constructor for SampleTrigger
threw an exception:
A ScriptHost error has occurred
[21/03/2018 14:38:58] Exception while executing function: SampleTrigger. FunctionsApp: The type initializer for 'FunctionsApp.SampleTrigger' threw an exception. FunctionsApp: Missing configuration for YIELD_SCIENCE.
Once we made sure that our fix throws an exception when the configuration is misssing we can proceed and update the configuration.
Now, we can update the configuration and check if we receive the desired
response from the GET
request to the /api/sample
endpoint.
We update the configuration in local.settings.json file
:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "",
// "YEILD_SCIENCE": 10
// FIX 2: Fix the configuration file
"YIELD_SCIENCE": 10
}
}
(Note that JSON files should not contain comments, they were used in this case for demonstration purposes only)
Again, we rebuild everything, restart function host, and make another curl
request:
$ curl localhost:7071/api/sample
(NOTE This failure is caused by how Azure Functions are handling some external libraries, including the libraries need by the runtime for the F# language. This problem may seem unrelated to this code sample, however it is pervasive when working with pre-compiled functions applications, therefore it is worth mentioning. Thankfully, there seems to be solution to this problem available soon)
Now we get the following surprising failure:
[21/03/2018 14:50:01] A ScriptHost error has occurred
[21/03/2018 14:50:01] Exception while executing function: SampleTrigger. FSharpLib: Could not load file or assembly 'FSharp.Core, Version=4.4.3.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. Could not find or load a specific file. (Exception from HRESULT: 0x80131621). System.Private.CoreLib: Could not load file or assembly 'FSharp.Core, Version=4.4.3.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'.
After quickly checking the src/FunctionsApp/bin/Debug/netstandard2.0/bin
directory we can confirm that there is a FSharp.Core.dll
with the requested
version:
$ cd XTeamDebuggingSample
$ cd src/FunctionsApp/bin/Debug/netstandard2.0/bin
$ monodis --assembly FSharp.Core.dll | grep Version
Version: 4.4.3.0
However, after checking the libraries provided with the Azure Function Runtime,
we can see that it comes with a different version of FSharp.Core.dll
:
$ cd /usr/local/Cellar/azure-functions-core-tools/2.0.1-beta.24/
$ monodis --assembly FSharp.Core.dll | grep Version
Version: 4.4.1.0
(NOTE that the path to the runtime is specific to macOS)
After some investigation, it turned out that since the last time anyone worked
on this code the FSharp.Core.dll
was updated from 4.4.1.0
to 4.4.3.0
, whereas
Azure Functions will permit only the version of FSharp.Core.dll
that comes with
the runtime.
Therefore, we need to update the FSharpLib.fsproj
file to fix the version of
FSharp.Core
package to provide FSharp.Core.dll
with version 4.4.1.0
:
<PropertyGroup>
<TargetFramework>netstandard2.0</TargetFramework>
<!-- FIX 3: Fix FSharp.Core to match version in Azure Functions Runtime -->
<FSharpCoreImplicitPackageVersion>4.1.18</FSharpCoreImplicitPackageVersion>
</PropertyGroup>
Finally, after re-building and re-starting the function host we can verify that we receive the expected response from the GET
request to the /api/sample
endpoint:
$ curl localhost:7071/api/sample
{
"scienceReport":"Our current science yield is 10",
"timeStamp":"2018-03-21T15:02:42.695296+00:00"
}
I hope that this code sample and associated description illustrates my debugging process.
In a real-life scenario, there are some further steps that I would perform that include:
- updating test cases, to potentially include the case with missing configuration,
- updating documentation, to potentially include information about newly found
problem with
FSharp.Core
being restricted by Azure Functions runtime, and - submitting everything for code review.
Unfortunately, the above steps are outside of the scope for this code sample.