Health Monitoring REST API¶

Knox provides REST-ful API for monitoring the core service. It primarily exposes the health of the Knox service that includes service status (up/down) as well as other health metrics. This is a work-in-progress feature, which started with an extensible framework to support basic functionalities. In particular, it currently supports the API to A) ping the service and B) time-based statistics related to all API calls.

Health Monitoring Setup¶

The basic setup includes two major steps A) add configurations to enable the metrics collection and reporting B) write a topology file and upload it into topologies directory.

Service Configurations¶

At first, we need to make sure the gateway configurations to gather and report to JMX are turned on in gateway-site.xml. The following two configurations into gateway-site.xml will serve the purpose.

<property>
   <name>gateway.metrics.enabled</name>
   <value>true</value>
   <description>Boolean flag indicates whether to enable the metrics collection</description>
</property>
<property>
   <name>gateway.jmx.metrics.reporting.enabled</name>
   <value>true</value>
   <description>Boolean flag indicates whether to enable the metrics reporting using JMX</description>
</property>

health.xml Topology¶

In order to enable health monitoring REST service, you need to add a new topology file (i.e. health.xml). The following is an example that is configured to test the basic functionalities of Knox service. It is highly recommended using more restricted authentication mechanism.

<topology>

    <gateway>

        <provider>
            <role>authentication</role>
            <name>ShiroProvider</name>
            <enabled>true</enabled>
            <param>
                <!-- 
                session timeout in minutes,  this is really idle timeout,
                defaults to 30 mins, if the property value is not defined,, 
                current client authentication would expire if client idles continuously for more than this value
                -->
                <name>sessionTimeout</name>
                <value>30</value>
            </param>
            <param>
                <name>main.ldapRealm</name>
                <value>org.apache.knox.gateway.shirorealm.KnoxLdapRealm</value>
            </param>
            <param>
                <name>main.ldapContextFactory</name>
                <value>org.apache.knox.gateway.shirorealm.KnoxLdapContextFactory</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory</name>
                <value>$ldapContextFactory</value>
            </param>
            <param>
                <name>main.ldapRealm.userDnTemplate</name>
                <value>uid={0},ou=people,dc=hadoop,dc=apache,dc=org</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory.url</name>
                <value>ldap://localhost:33389</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory.authenticationMechanism</name>
                <value>simple</value>
            </param>
            <param>
                <name>urls./**</name>
                <value>authcBasic</value>
            </param>
        </provider>

        <provider>
            <role>authorization</role>
            <name>AclsAuthz</name>
            <enabled>false</enabled>
            <param>
                <name>knox.acl</name>
                <value>admin;*;*</value>
            </param>
        </provider>

        <provider>
            <role>identity-assertion</role>
            <name>Default</name>
            <enabled>false</enabled>
        </provider>

        <provider>
            <role>hostmap</role>
            <name>static</name>
            <enabled>true</enabled>
            <param><name>localhost</name><value>sandbox,sandbox.hortonworks.com</value></param>
        </provider>

    </gateway>

    <service>
        <role>HEALTH</role>
    </service>

</topology>

Just as with any Knox service, the gateway providers protect the health monitoring REST service defined above it. In this case, the ShiroProvider is taking care of HTTP Basic Auth using LDAP. Once the user authenticates with LDAP, the request processing continues to the Health service that will perform the necessary actions.

The authenticate/federation provider can be swapped out to fit your deployment environment.

After creating the file health.xml with above contents, you need to copy the file to KNOX_HOME/conf/topologies directory. If Knox/gateway service is not running, you can start it using "bin/gateway.sh start". Otherwise the service would automatically pick this new 'health' service. When gateway service registers the new service, it displays the following log messages in log/gateway.log.

2017-08-22 03:44:25,045 INFO  knox.gateway (GatewayServer.java:handleCreateDeployment(677)) - Deploying topology health to /home/joe/knox/knox-0.12.0/bin/../data/deployments/health.topo.15e080a91c0
2017-08-22 03:44:25,045 INFO  knox.gateway (GatewayServer.java:internalDeactivateTopology(596)) - Deactivating topology health
2017-08-22 03:44:25,119 INFO  knox.gateway (DefaultGatewayServices.java:initializeContribution(197)) - Creating credential store for the cluster: health
2017-08-22 03:44:25,142 INFO  knox.gateway (GatewayServer.java:internalActivateTopology(566)) - Activating topology health
2017-08-22 03:44:25,142 INFO  knox.gateway (GatewayServer.java:internalActivateArchive(576)) - Activating topology health archive %2F

Verify¶

Once the health service is active, you can verify it by using the following curl command. The 'ping' end point displays if the service is up. This end point can be utilized for monitoring the basic health of a Knox service.

$ curl -i -k -u guest:guest-password -X GET 'https://localhost:8445/gateway/health/v1/ping'
HTTP/1.1 200 OK
Date: Tue, 22 Aug 2017 07:09:37 GMT
Set-Cookie: JSESSIONID=1o82bcvoqbhbb1apt7zs8ubybb;Path=/gateway/health;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/gateway/health; Max-Age=0; Expires=Mon, 21-Aug-2017 07:09:37 GMT
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/plain; charset=ISO-8859-1
Content-Length: 3
Server: Jetty(9.2.15.v20160210)

OK

To retrieve the meaningful metrics details of various service calls, you may need to run multiple REST calls such as the followings. After that, execute the metrics REST call as shown below with a sample output. As shown, metrics output is returned in JSON format.

 HTTP/1.1 Date: Set-Cookie: Expires: Set-Cookie: Content-Type: Cache-Control: Tra Server:                                                                                                                            < 


name="__codelineno-4-1" href="#__codelineno-4-1">curl -i -k -u guest:guest-password -X GET 'https://localhost:8445/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS' name="__codelineno-5-1" href="#__codelineno-5-1">$ curl -i -k -u guest:guest-password -X GET 'https://localhost:8445/gateway/health/v1/metrics?pretty=true' 200 OK Tue, 22 Aug 2017 07:10:44 GMT JSESSIONID=kqntcdaje9uai3pup7ffvfw4;Path=/gateway/health;Secure;HttpOnly Thu, 01 Jan 1970 00:00:00 GMT rememberMe=deleteMe; Path=/gateway/health; Max-Age=0; Expires=Mon, 21-Aug-2017 07:10:44 GMT application/json must-revalidate,no-cache,no-store nsfer-Encoding: chunked Jetty(9.2.15.v20160210) { "version" : "3.0.0", "gauges" : { }, "counters" : { }, "histograms" : { }, "meters" : { }, "timers" : { "client./gateway/health/v1/metrics.GET-requests" : { "count" : 5, "max" : 0.624587973, "mean" : 0.027655743001736188, "min" : 0.006145587, "p50" : 0.010020548, "p75" : 0.010020548, "p95" : 0.074454725, "p98" : 0.624587973, "p99" : 0.624587973, "p999" : 0.624587973, "stddev" : 0.0929226225229978, "m15_rate" : 2.657500857422334E-7, "m1_rate" : 5.770087852901534E-89, "m5_rate" : 4.769163772973399E-19, "mean_rate" : 4.0952378345310894E-4, "duration_units" : "seconds", "rate_units" : "calls/second" }, "client./gateway/health/v1/ping.GET-requests" : { "count" : 1, "max" : 0.017257638000000002, "mean" : 0.017257638000000002, "min" : 0.017257638000000002, "p50" : 0.017257638000000002, "p75" : 0.017257638000000002, "p95" : 0.017257638000000002, "p98" : 0.017257638000000002, "p99" : 0.017257638000000002, "p999" : 0.017257638000000002, "stddev" : 0.0, "m15_rate" : 0.18710139700632353, "m1_rate" : 0.0735758882342885, "m5_rate" : 0.1637461506155964, "mean_rate" : 0.014990517517814805, "duration_units" : "seconds", "rate_units" : "calls/second" }, "client./gateway/sandbox/health/v1/.GET-requests" : { "count" : 1, "max" : 4.01873E-4, "mean" : 4.01873E-4, "min" : 4.01873E-4, "p50" : 4.01873E-4, "p75" : 4.01873E-4, "p95" : 4.01873E-4, "p98" : 4.01873E-4, "p99" : 4.01873E-4, "p999" : 4.01873E-4, "stddev" : 0.0, "m15_rate" : 2.536740427767808E-7, "m1_rate" : 7.074903404511115E-90, "m5_rate" : 4.081014139447941E-19, "mean_rate" : 8.179827684854002E-5, "duration_units" : "seconds", "rate_units" : "calls/second" }, "client./gateway/sandbox/v1/health/.GET-requests" : { "count" : 1, "max" : 5.470700000000001E-4, "mean" : 5.470700000000001E-4, "min" : 5.470700000000001E-4, "p50" : 5.470700000000001E-4, "p75" : 5.470700000000001E-4, "p95" : 5.470700000000001E-4, "p98" : 5.470700000000001E-4, "p99" : 5.470700000000001E-4, "p999" : 5.470700000000001E-4, "stddev" : 0.0, "m15_rate" : 2.413022137213267E-7, "m1_rate" : 3.341947732164585E-90, "m5_rate" : 3.512561421726287E-19, "mean_rate" : 8.149518570285245E-5, "duration_units" : "seconds", "rate_units" : "calls/second" }, "client./gateway/sandbox/webhdfs/v1/.GET-requests" : { "count" : 4, "max" : 0.463745401, "mean" : 0.024924118143299912, "min" : 0.016542244, "p50" : 0.024799078000000002, "p75" : 0.033933548, "p95" : 0.033933548, "p98" : 0.033933548, "p99" : 0.033933548, "p999" : 0.033933548, "stddev" : 0.007284773511002474, "m15_rate" : 2.120680068580741E-8, "m1_rate" : 4.7541228609699333E-91, "m5_rate" : 1.5806080232092864E-20, "mean_rate" : 2.7314359915623396E-4, "duration_units" : "seconds", "rate_units" : "calls/second" }, "service./gateway/sandbox/webhdfs/v1/.get-requests" : { "count" : 3, "max" : 0.014635496000000001, "mean" : 0.00342438191233768, "min" : 0.0020088890000000002, "p50" : 0.0020088890000000002, "p75" : 0.005144646, "p95" : 0.005144646, "p98" : 0.005144646, "p99" : 0.005144646, "p999" : 0.005144646, "stddev" : 0.0015604555820128599, "m15_rate" : 1.9913776931949195E-8, "m1_rate" : 3.1334281325640874E-91, "m5_rate" : 1.055281734633953E-20, "mean_rate" : 2.0486339070804923E-4, "duration_units" : "seconds", "rate_units" : "calls/second" } } /a>}
REST End Points¶
As mentioned above, currently Knox provides a few monitoring APIs to start with. The list will gradually grow to support new use-cases.
/ping¶
This end-point can be used to determine if a Knox gateway service is alive or not. It is useful for basic health monitoring of the core service. Although most of the results of REST calls are in JSON format, this one (/ping) is in plain text.  
Sample response
OK

/metrics¶
This end-point returns all Knox metrics grouped by individual call type. For example, timer metrics for all webhdfs calls are aggregated into one set of metrics and then returned in a separate JSON element. This end-point also supports an option (/metrics?pretty=true) to pretty print the metrics output.
A sample response with pretty=true is shown below:
{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "client./gateway/health/v1/ping.GET-requests" : {
      "count" : 1,
      "max" : 0.017257638000000002,
      "mean" : 0.017257638000000002,
      "min" : 0.017257638000000002,
      "p50" : 0.017257638000000002,
      "p75" : 0.017257638000000002,
      "p95" : 0.017257638000000002,
      "p98" : 0.017257638000000002,
      "p99" : 0.017257638000000002,
      "p999" : 0.017257638000000002,
      "stddev" : 0.0,
      "m15_rate" : 0.18710139700632353,
      "m1_rate" : 0.0735758882342885,
      "m5_rate" : 0.1637461506155964,
      "mean_rate" : 0.014990517517814805,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "client./gateway/sandbox/v1/health/.GET-requests" : {
      "count" : 1,
      "max" : 5.470700000000001E-4,
      "mean" : 5.470700000000001E-4,
      "min" : 5.470700000000001E-4,
      "p50" : 5.470700000000001E-4,
      "p75" : 5.470700000000001E-4,
      "p95" : 5.470700000000001E-4,
      "p98" : 5.470700000000001E-4,
      "p99" : 5.470700000000001E-4,
      "p999" : 5.470700000000001E-4,
      "stddev" : 0.0,
      "m15_rate" : 2.413022137213267E-7,
      "m1_rate" : 3.341947732164585E-90,
      "m5_rate" : 3.512561421726287E-19,
      "mean_rate" : 8.149518570285245E-5,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "client./gateway/sandbox/webhdfs/v1/.GET-requests" : {
      "count" : 4,
      "max" : 0.463745401,
      "mean" : 0.024924118143299912,
      "min" : 0.016542244,
      "p50" : 0.024799078000000002,
      "p75" : 0.033933548,
      "p95" : 0.033933548,
      "p98" : 0.033933548,
      "p99" : 0.033933548,
      "p999" : 0.033933548,
      "stddev" : 0.007284773511002474,
      "m15_rate" : 2.120680068580741E-8,
      "m1_rate" : 4.7541228609699333E-91,
      "m5_rate" : 1.5806080232092864E-20,
      "mean_rate" : 2.7314359915623396E-4,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}