Azure Function Trigger Cheat Sheet
This is a quick reference covering general topics for the most common Azure Function triggers I deal with on a weekly basis. Knowing some of the high level terminology and architecture has helped me *maybe* sound somewhat knowledgeable in conversations with colleagues and customers. Its by no means a conclusive list or troubleshooting guide but hopefully it can be used as quick read/refresher.
Azure Eventhub/Azure IoT Hub
Azure Storage Blobs
Azure Storage Queues
Azure Event Grid
Azure Event Grid allows you to easily build applications with event-based architectures. First, select the Azure resource you would like to subscribe to, and then give the event handler or WebHook endpoint to send the event to. Event Grid has built-in support for events coming from Azure services, like storage blobs and resource groups. Event Grid also has support for your own events, using custom topics.
Protocol – EventGrid calls the Function app with HTTPS (443)
Additional info –
Validation occurs to make sure the endpoint (function) is ready to receive the events. The system key (not function or host key) is used to authenticate EventGrid to the Azure function. The validation can occur one of 2 ways:
- Use and HTTP trigger and use the sample code below to validate the EventGrid subscription.
- Using the EventGrid Webjob extension – no additional code needed for the validation
Controlling throughput – Use the entries in the host.json for HTTP functions.
- If you are seeing 429 HTTP responses use the maxOutstandingRequests, maxConcurrentRequests, and dynamicThrottlesEnabled settings to increase throughput.
Examples of getting the Keys:
- Troubleshooting, scaling, and the general behavior besides the validation is the same way with a standard HTTP trigger.
- Validation has to occur before the EventGrid will send messages to the Function App. If the validation fails Eventgrid will keep retrying the validation until it is successful or the user may need to try to reconfigure the function as a webhook.
Restricting Access – No clear documentation on what IPs Eventgrid requests will come from so adding IP restrictions on the function app may ultimately break the connection.
Azure Blob Triggers
Protocol – The Function app calls the Blob endpoint using Https
- Under the covers the runtime scans for existing blobs using the blob logs/checks for exisiting blob in the container and adds the message to a queue which then triggers the function app. There are times where the runtime may timeout trying to get the blobs if there is a significant amount of blobs or entries in the logs.
- For endpoints that need high throughput customers should use EventGrid instead.
- The blob trigger does not work well with containers that have 10k or more blobs
Controlling Throughput – Use host.json entries from the Azure Storage Queue
- Make sure Blob logging is enabled on the storage account
- New logging in the function runtime to come for the blob storage (in the release around 7/26/2019)
- Creating a new blob trigger on an existing blob container that has millions of blobs may not process all of the blobs. The recommendation would create a service that can write all of those blobs to a queue and use a queue triggered function
Restricting Access – Storage Firewalls in general do not work well with Azure Functions – this may change in the future. (7/2019)
Azure Eventhub/IoTHub Trigger
Protocol – The function app connects to EventHub using AMQP (port 5671). We do not support HTTP at this time.
- The Eventhub trigger provides at least once execution for each message.
- Eventhubs are persistent queue where messages exist until their expiration time (configured on the Eventhub). Messages are not deleted by the runtime; they persist even after they are processed.
- Partitions – Each eventhub has between 1-32 partitions. Only one client (ie Instance of the function app) can connect to 1 partition at a time. Therefore, if you have 16 partitions you can only have 16 active instances. This is not a limitation of a function app but a design within EventHubs. The partition count cannot be changed after the EventHub is created so to change it you must redeploy the EventHub namespace.
- Consumer groups allow multiple clients to process the same messages at the same time
- The Eventhub SDK, which we use in the runtime, uses blobs to checkpoint its progress in each partition
- The EventHub SDK uses leases as it reads
messages from the EventHub partition. You may see errors about lease lost
exceptions about Epoch or blob not found. These are expected, see below.
Recommendations for high throughput
- Batching vs using a single message processing – use EventData  vs processing single messages for high throughput and efficiency
- For small batch sizes checkpoint less frequently if you see performance issues
- For large batches checkpoint frequently to avoid the reprocessing of messages
- Make sure messages are being pushed to all partitions. This is the default when using the client SDK unless you specify a specific partition.
- For additional controls see the host.json
- Make sure all partitions are being used – check the scale controller logs (not customer facing at this time 7/30/2019)
- Has the customer tested their function for the load/number of partitions they have configured.
- There is not a supported way to manually move the checkpoint ahead (7/30/2019), although you can manually try to do it depending on how good your Bing/Google skills are J.
Azure Storage Queues
Protocol – The azure function calls the storage queue using HTTPs
- Control throughput via the host.json (BatchSize, newBathThreshold, maxPollingInterval)
- Messages that are processed x amount of times due to failure will end up in the poison queue.
- Messages are “hidden” from the queue when the runtime picks it up using a visibility timeout. If the function completes the processing in the allotted amount of time it will delete the message from the queue. If the message fails the message’s dequeue count will increase up until it is added to the poison queue. If the message does not complete it the allotted time it will become visible in the queue for an instance to retry the message
- Exceptions returned by storage
- Log the message ID and dequeue count and trace messages in application insights when trying to track why messages may be ending up in the poison queue
- The batch size and newBatchThreshold to values that will cause optimal throughput.
- Add messages to the poison queue to avoid reprocessing when necessary to avoid reprocessing of the same messages over and over.
Protocol – You guessed it…. The function is called via HTTP (80) or HTTPS (443) J. In general HTTPS should always be used as best practices.
- Http Triggered functions follow the exact same route as regular web apps. Frontend -> worker. So the general behavior of HTTP requests holds true in terms of TLS, custom domains, SSL, ect
- All requests have to complete prior to 240 seconds otherwise the Front Ends will send a 502 to the client indicating a timeout. This will not show up in the Function Logs or in application insights.
- If the function needs to execute longer than 240 seconds use the long running pattern described in the doc below.
- If the app is returning 429s the is most likely
due to the built in functionality to help control throughput. You can modify or
disable some settings to control the behavior.
- Keys – They can be set using the new ARM
APIs manually, in the portal, or the runtime can generate them
- Function – Allows access to just the function
- Admin – The masterkey (all access key)
- Networking restrictions – Typically IP based
- Authentication and Authorization
Protocol – the function calls SB using AMQP (port 5671)
- Namespace – the main container that makes of the SB
- Queue – Store for messages
- Topic – Typically used when you have multiple subscribers reading messages
- Subscription – A concept that plays hand in hand with topics to allow multiple subscribers (clients)
- PeekLock – The Functions runtime receives a message in PeekLock mode. It calls Complete on the message if the function finishes successfully, or calls Abandon if the function fails. If the function runs longer than the PeekLock timeout, the lock is automatically renewed as long as the function is running.
- maxAutoRenewDuration – How long the runtime will continue to renew the lock automatically. Setting this value to short may cause the message dequeue count to increase unexpectedly
- Session support (FIFO) is in preview. (7/30/2019)
- Using the values in the host.json (prefetchCount and maxConcurrentCalls)
- Messages that are not processed after your dequeue count is exceeded will end up in the deadletter queue. Deadletter messages with a reasonable dequeue count to avoid unnecessary retries of messages if they are going to continue to fail
- If messages are ending up unexpectedly in the deadletter queue or messages are seeing unexpected retries log out the message ID and dequeue count for each execution.
- Also make sure maxAutoRenewTimout is long enough for messages that are in the batch as well as the prefetched count to be processed. Prefetched messages are stored in memory on the local machine. For example batch size of 2, prefetch count of 100, where messages take 60 seconds to process, and a maxAutoRenewTimeout of 5 minutes will not allow adequate time for the prefetched messages to finish processing prior to the timeout on the message expiring.
Protocol – The function app uses SQL Core API to call CosmosDB
- Change Feed–
Change feed support in Azure Cosmos DB works by listening to an Azure Cosmos DB
container for changes and adds the details of the changes to a lease container.
Detailed doc on the change feed concept: https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed#features-of-change-feed
- Change feed processor – entity in the Azure CosmosDB SDK that simplifies the reading of the change feed and allows for dynamic scaling. It guarantees at least once delivery.
- Monitored Container – Container being monitored for inserts and changes
- Lease Container – Maintains the state of changes.
It can be manually created or the runtime can create it if the CreateLeaseCollectionIfNotExists flag
is set. Similar to Eventhubs, check pointing is managed by the runtime.
- Partitions – If the lease container has multiple partitions (or divisions of the container) a id must be provided to the function.
- Choosing the right connection option, configurable
in the host.json –
- Gateway Mode (default) – uses HTTP (443) and is recommended for functions
- Direct Mode – Uses TCP and HTTPS over ports 10000-20000
- function.json configurations
- The scaling is dynamic (based on Matias’s comment here) based on the load and how Cosmos is setting up the partitions keys for the monitored collection. So the function should be able to scale appropriately in consumption.
- Timer triggers a triggered off of CRON expressions
- You can change the site’s time zone using WEBSITE_TIME_ZONE which will affect the CRON expression configured
- Application Insights sampling is a common cause for logs not showing up in application Insights. Disable sampling for further investigation if executions are missing.
- If a timer is not running at all make sure its not sharing a storage account with another function using timer triggers make sure to specify an ID.
- Function executing off time – Usually it’s one
of the two issues
- The function is configured to runOnStartup which
means anytime an instance starts up and another instance does not have a lock (it
may be for every function starting up not 100% sure, either way it’s probably
not what you want) on the timer blob it will execute the timer function.
- If the timer function never completed due to a crash, expected/unexpected host shutdown, restart of the app, ect the timer trigger will not be marked as completed so when the host starts up it will recognize the timer as past due causing it to fire at that point in time. This can cause really unpredicatable timings especially in the consumption plan where the function does not have Always on. To disable this feature set useMonitor to false.
- The function is configured to runOnStartup which means anytime an instance starts up and another instance does not have a lock (it may be for every function starting up not 100% sure, either way it’s probably not what you want) on the timer blob it will execute the timer function.
- The next timer will not execute unless the previous timer has completed.