Apigee Hybrid off the Grid: What happens when the internet goes down?

Objective

Apigee Hybrid consists of control plane and runtime plane. The control plane is hosted on GCP and is accessible securely over the internet. The runtime plane can be hosted on any supported Kubernetes platform on GCP, AWS, Azure or on customers on-premises ( using Anthos, RKE, Openshift). The runtime plane needs consistent connectivity to the control plane. The objective of this blog is to provide information about impact encountered by Apigee hybrid runtime components when connectivity to the Apigee hybrid control plane is lost. 

Apigee Hybrid components external connection

The runtime plane communicates with the control plane over internet. The below table describes the GCP URLs used for communications by the hybrid runtime plane:

 

Apigee Hybrid Component

GCP URL Accessed

Ingress

NA

Message Processor

NA

Synchronizer

apigee.googleapis.com

iamcredentials.googleapis.com

UDCA (Analytics)

apigee.googleapis.com

storage.googleapis.com

Apigee Connect

apigeeconnect.googleapis.com

Prometheus (metrics)

monitoring.googleapis.com.

fluentd (logging)

logging.googleapis.com

MART

iamcredentials.googleapis.com

Message Processor (Optional)

cloudtrace.googleapis.com

Watcher

apigee.googleapis.com

 

The following image shows the ports used for external communications with the hybrid runtime plane:

arch-external-connections.png

Loss of connectivity to Apigee Hybrid control plane

The connection to the control plane may be lost for various reasons, including:

  • Network issues in the datacenter
  • Issues connecting to the proxy server (if used)
  • Firewall issues preventing access to the control plane

Impact on runtime components

Synchronizer

Functionality 

Synchronizer retrieves data from the control plane and stores it in Cassandra, which is a shared backend used by all synchronizers. After data replication in Cassandra, a zip file is created locally for use by message processors.

In a multi-region setup, Cassandra is shared by all synchronizers. To prevent redundant data downloads, only the first synchronizer to poll and discover that the data is unavailable locally will retrieve it from the control plane. Subsequent synchronizers will then download the data from Cassandra.

The configuration data downloaded by the Synchronizer includes:

  • Proxy bundles
  • shared flow deployments
  • Flow hooks
  • Environment information
  • Target server definitions
  • TLS settings
  • Data masks
  • Data collectors
  • Hybrid trace configuration

Impact

  • Download of newer configuration data (mentioned above) will fail 
  • Existing configuration will continue to work , if pods are restarted the data is pulled from cassandra to resume runtime operations . 
  • No Impact on Apigee Northbound & Southbound Traffic
  • You will also see below errors in pod logs and pods will go into CrashLoopBackOff

 

{"level":"SEVERE","thread":"NIOThread@1","mdc":{},"className":"com.apigee.probe.ProbeAPI","method":"getResponse","severity":"SEVERE","message":"probe failed with details ProbeStatusResponse{isProbeSuccessful=false, failureMessages=[Probe ControlPlaneErrorMonitor failed due to Error in connecting to control plane more than 5 times consecutively.]}","formattedDate":"2024-05-15T05:54:17.724Z","logger":"ProbeAPI"}

 

Message Processor

Functionality 

The apigee-runtime is responsible for processing incoming API requests, executing policies, and forwarding them to the appropriate target services. To carry out these tasks, the runtime interacts with the synchronizer and cassandra.

The apigee-runtime continuously  polls the synchronizer to get the latest configuration containing proxies, resources, target servers, and related entities, such as trace data and encryption keys. The runtime data is stored in the Cassandra database.

The apigee-runtime configuration is set at the environment level, and each environment has one or more apigee-runtime pods depending on the number of replicas.

Impact

  • Newer APIs won't be deployed as synchronizer will not be able to download the bundles
  • Apigee Debug will be impacted as the debug signals wont reach the runtime pods.
  • No Impact on Apigee Northbound & Southbound Traffic for existing APIs
  • You will also see below errors in pod logs

{"level":"SEVERE","thread":"Apigee-Timer-1","mdc":{"action":"RUNTIME-SYNC","env":"test1","org":"apigee-hybrid-378710"},"className":"com.apigee.hybrid.runtime.contract.load.sync.context.HttpContractDownloader","method":"lambda$download$0","severity":"SEVERE","message":"Failed to get version. Cause: Not Found [CONTEXT ratelimit_period=\"10 MINUTES [skipped: 13]\" ]","formattedDate":"2024-05-15T06:02:36.977Z","logger":"API-SECURITY-CONTRACT-REPLICATION"}

{"level":"SEVERE","thread":"Apigee-Timer-1","mdc":{},"className":"com.apigee.threadpool.PollTask","method":"runTask","severity":"SEVERE","message":"Error during refresh [CONTEXT ratelimit_period=\"10 MINUTES [skipped: 13]\" ]","formattedDate":"2024-05-15T06:02:36.977Z","logger":"API-SECURITY-CONTRACT-REPLICATION","exceptionStackTrace":"com.apigee.hybrid.runtime.contract.replication.DownloadException{ code = sync.replicators.DownloadError, message = Error downloading Version zip file : cause Not Found, associated contexts = []}\n"}

 

UDCA

Functionality 

The Universal Data Collection Agent (UDCA) is a service running in the runtime plane that extracts analytics, debug, and deployment status data and sends it to the UAP (Unified Analytics Platform) on GCP.

Impact

  • Analytics Data won't be sent to GCP immediately 
    • Files are buffered in the pod if analytics upload fails due to network issue.
    • All failed files will be tried for upload after 60 seconds
    • All failed files will be discarded after 3 retires 
  • No Impact on Apigee Northbound & Southbound Traffic for existing APIs
  • You will also see below errors in pod logs and pods will go into CrashLoopBackOff

{"level":"error","ts":1715752630.1903248,"caller":"log/logger.go:85","msg":"Encountered http error while uploading file \"api.xxxx-xxxx.test1.MP-UDCA-CHANNEL_0\". Details: http error received with code = xxx for service = \"DATALOCATION\" with message = \"unable to generate signed url. details: {\\n  \\\"error\\\": {\\n    \\\"code\\\": xxx,\\n    \\\"message\\\": \\\"xxx \\\\\\\"organizations/xxxx/environments/test1/datalocation\\\\\\\" (or it may not exist)\\\",\\n    \\\"status\\\": \\\"xxx\\\"\\n  }\\n}\\n\"","stacktrace":"edge-internal.git.corp.google.com/uap/aau/log.Errorf\n\t/go/src/edge-internal/uap/aau/log/logger.go:85\nedge-internal.git.corp.google.com/uap/aau/handler.(*handler).HandleError\n\t/go/src/edge-internal/uap/aau/handler/handler.go:132\nedge-internal.git.corp.google.com/uap/aau/handler.(*handler).Run\n\t/go/src/edge-internal/uap/aau/handler/handler.go:72"}

{"level":"info","ts":1715752630.1903994,"caller":"log/logger.go:65","msg":"Reverting file \"/opt/apigee/data/api/staging/1715471290207.api.xxxxx.test1.65fef1f5-baa7-47f7-966e-f4cb5b0c86f1_0.gz\" to original name \"/opt/apigee/data/api/api.xxxxx.test1.MP-UDCA-CHANNEL_0.gz\""}

{"level":"info","ts":1715752630.190494,"caller":"log/logger.go:65","msg":"Updating retry count for file \"api.apigee-xx-xx.test1.MP-UDCA-CHANNEL_0\" to 1"}

Prometheus (metrics)

Functionality 

All Apigee hybrid  components deployed on K8s clusters expose an HTTP/HTTPS Prometheus endpoint that the Apigee metrics pods can scrape. Each application publishes their metrics in OpenCensus format and should support either one-way TLS or mTLS.

Apigee uses OpenTelemetry collector to scrape metrics from the Kubernetes pods over HTTP(s) and sends customer-facing metrics to the customer project.

OpenTelemetry collector sends following customer-facing metrics:

 

Monitored Resources

Metric Name

apigee.googleapis.com/Proxy

apigee.googleapis.com/proxy/request_count

apigee.googleapis.com/Proxy

apigee.googleapis.com/proxy/response_count

apigee.googleapis.com/Proxy

apigee.googleapis.com/proxy/latencies

apigee.googleapis.com/Target

apigee.googleapis.com/target/request_count

apigee.googleapis.com/Target

apigee.googleapis.com/target/response_count

apigee.googleapis.com/Target

apigee.googleapis.com/target/latencies

 

Impact

  • No Impact on Apigee Northbound & Southbound Traffic for existing APIs
  • Apigee Hybrid Metrics data won't be pushed to GCP Monitoring

Fluentd (logging)

Functionality 

Apigee logging pods are deployed as a daemonset across the kubernetes clusters . The daemonset contains one fluentd container augmented with plugins for SD storage, prometheus and tailing files . The container is responsible for tailing log files from the container logs directory for all containers with apigee-* prefix.

The logs are then pushed to GCP Logging

Impact

  • No Impact on Apigee Northbound & Southbound Traffic for existing APIs
  • Apigee Hybrid Logging data won't be pushed to GCP Logging
    • Fluent bit internally has buffering and retry logic. Details
    • Retry logic is not exposed through Apigee overrides

MART

Functionality 

Data that belongs to your Apigee organization and is accessed during runtime API calls are stored by Cassandra in the runtime plane.

This data includes:

  • Application configurations
  • Key Management System (KMS) data
  • Cache
  • Key Value Maps (KVMs)
  • API products
  • Developer apps

To access and update that data for example, to add a new KVM or to remove an environment you can use the Apigee hybrid UI or the Apigee APIs

The MART server (Management API for Runtime data) processes the API calls against the runtime datastore.

Impact

  • No Impact on Apigee Northbound & Southbound Traffic for existing APIs
  • Newer data (mentioned above) won't be available to cassandra .

Apigee Connect

Functionality 

With Apigee Connect, the Apigee hybrid management plane can securely connect to the MART service in the runtime plane, eliminating the need to expose the MART endpoint on the internet. 

Impact

  • No Impact on Apigee Northbound & Southbound Traffic for existing APIs
  • Connectivity of MART to Apigee control plane is affected as MART depends on Apigee connect.

Watcher

Functionality 

Watcher is responsible for periodically executing tasks in the Apigee runtime k8s cluster. Tasks currently performed by watcher:

  • Ingress configuration: Watcher regularly retrieves the Apigee organization's ingress configuration from the control plane and creates ApigeeRoutes for each environment group. ApigeeRoute components are then processed by the ApigeeRoute controller to generate Istio gateways and virtual services in the cluster.
  • Ingress status: Watcher determines the ingress status. This status is then forwarded to the control plane, which utilizes it to calculate the deployment status of a proxy.
  • MP deployment status: The Watcher component is responsible for gathering deployment status updates from all the runtime pods. It consolidates this information and forwards the combined status to the control plane.
  • Pods availability metrics: Watcher stores all the pods running in the cluster. It then calculates ready and total replicas based on the application type and exposes it for prometheus to scrape.

Impact

  • No Impact on Apigee Northbound & Southbound Traffic for existing APIs
  • Newer ingress configuration won't be available for runtime.
  • Apigee UI will not show the correct status for APIs deployed.

Summary

Apigee Hybrid is built for resilience, not isolation. While it can tolerate temporary disruptions, extended offline periods may impair its core functionality, especially the ability to deploy new or modified proxies. This highlights the need for robust network infrastructure and a focus on maintaining consistent connectivity. By understanding these details, organizations can better plan for and mitigate the risks associated with network disruptions.

 

Contributors
Version history
Last update:
3 weeks ago
Updated by: