WebHDFS

WebHDFS¶

REST API access to HDFS in a Hadoop cluster is provided by WebHDFS or HttpFS. Both services provide the same API. The WebHDFS REST API documentation is available online. WebHDFS must be enabled in the hdfs-site.xml configuration file and exposes the API on each NameNode and DataNode. HttpFS however is a separate server to be configured and started separately. In the sandbox this configuration file is located at /etc/hadoop/conf/hdfs-site.xml. Note the properties shown below as they are related to configuration required by the gateway. Some of these represent the default values and may not actually be present in hdfs-site.xml.

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>
<property>
    <name>dfs.namenode.rpc-address</name>
    <value>sandbox.hortonworks.com:8020</value>
</property>
<property>
    <name>dfs.namenode.http-address</name>
    <value>sandbox.hortonworks.com:50070</value>
</property>
<property>
    <name>dfs.https.namenode.https-address</name>
    <value>sandbox.hortonworks.com:50470</value>
</property>

The values above need to be reflected in each topology descriptor file deployed to the gateway. The gateway by default includes a sample topology descriptor file {GATEWAY_HOME}/deployments/sandbox.xml. The values in this sample are configured to work with an installed Sandbox VM.

Please also note that the port changed from 50070 to 9870 in Hadoop 3.0.

<service>
    <role>NAMENODE</role>
    <url>hdfs://localhost:8020</url>
</service>
<service>
    <role>WEBHDFS</role>
    <url>http://localhost:50070/webhdfs</url>
</service>

The URL provided for the role NAMENODE does not result in an endpoint being exposed by the gateway. This information is only required so that other URLs can be rewritten that reference the Name Node's RPC address. This prevents clients from needing to be aware of the internal cluster details.

By default the gateway is configured to use the HTTP endpoint for WebHDFS in the Sandbox. This could alternatively be configured to use the HTTPS endpoint by providing the correct address.

HDFS NameNode Federation¶

NameNode federation introduces some additional complexity when determining to which URL(s) Knox should proxy HDFS-related requests.

The HDFS core-site.xml configuration includes additional properties, which represent options in terms of the NameNode endpoints.

Property Name	Description	Example Value
dfs.internal.nameservices	The list of defined namespaces	ns1,ns2

For each value enumerated by dfs.internal.nameservices, there is another property defined, for specifying the associated NameNode names.

Property Name	Description	Example Value
dfs.ha.namenodes.ns1	The NameNode identifiers associated with the ns1 namespace	nn1,nn2
dfs.ha.namenodes.ns2	The NameNode identifiers associated with the ns2 namespace	nn3,nn4

For each namenode name enumerated by each of these properties, there are other properties defined, for specifying the associated host addresses.

Property Name	Description	Example Value
dfs.namenode.http-address.ns1.nn1	The HTTP host address of nn1 NameNode in the ns1 namespace	host1:50070
dfs.namenode.https-address.ns1.nn1	The HTTPS host address of nn1 NameNode in the ns1 namespace	host1:50470
dfs.namenode.http-address.ns1.nn2	The HTTP host address of nn2 NameNode in the ns1 namespace	host2:50070
dfs.namenode.https-address.ns1.nn2	The HTTPS host address of nn2 NameNode in the ns1 namespace	host2:50470
dfs.namenode.http-address.ns2.nn3	The HTTP host address of nn3 NameNode in the ns2 namespace	host3:50070
dfs.namenode.https-address.ns2.nn3	The HTTPS host address of nn3 NameNode in the ns2 namespace	host3:50470
dfs.namenode.http-address.ns2.nn4	The HTTP host address of nn4 NameNode in the ns2 namespace	host4:50070
dfs.namenode.https-address.ns2.nn4	The HTTPS host address of nn4 NameNode in the ns2 namespace	host4:50470

So, if Knox should proxy the NameNodes associated with ns1, and the configuration does not dictate HTTPS, then the WEBHDFS service must contain URLs based on the values of dfs.namenode.http-address.ns1.nn1 and dfs.namenode.http-address.ns1.nn2. Likewise, if Knox should proxy the NameNodes associated with ns2, the WEBHDFS service must contain URLs based on the values of dfs.namenode.http-address.ns2.nn3 and dfs.namenode.http-address.ns2.nn3.

Fortunately, for Ambari-managed clusters, descriptors and service discovery can handle this complexity for administrators. In the descriptor, the service can be declared without any endpoints, and the desired namespace can be specified to disambiguate which endpoint(s) should be proxied by way of a parameter named discovery-namespace.

"services": [
  {
    "name": "WEBHDFS",
    "params": {
      "discovery-nameservice": "ns2"
    }
  },

If no namespace is specified, then the default namespace will be applied. This default namespace is derived from the value of the property named fs.defaultFS defined in the HDFS core-site.xml configuration.

WebHDFS URL Mapping¶

For Name Node URLs, the mapping of Knox Gateway accessible WebHDFS URLs to direct WebHDFS URLs is simple.

Type	URL
Gateway	`https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/webhdfs`
Cluster	`http://{webhdfs-host}:50070/webhdfs`

However, there is a subtle difference to URLs that are returned by WebHDFS in the Location header of many requests. Direct WebHDFS requests may return Location headers that contain the address of a particular DataNode. The gateway will rewrite these URLs to ensure subsequent requests come back through the gateway and internal cluster details are protected.

A WebHDFS request to the NameNode to retrieve a file will return a URL of the form below in the Location header.

http://{datanode-host}:{data-node-port}/webhdfs/v1/{path}?...

Note that this URL contains the network location of a DataNode. The gateway will rewrite this URL to look like the URL below.

https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/webhdfs/data/v1/{path}?_={encrypted-query-parameters}

The {encrypted-query-parameters} will contain the {datanode-host} and {datanode-port} information. This information along with the original query parameters are encrypted so that the internal Hadoop details are protected.

WebHDFS Examples¶

The examples below upload a file, download the file and list the contents of the directory.

WebHDFS via client DSL¶

You can use the Groovy example scripts and interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleWebHdfsPutGet.groovy
java -jar bin/shell.jar samples/ExampleWebHdfsLs.groovy

You can manually type the client DSL script into the KnoxShell interactive Groovy interpreter provided with the distribution. The command below starts the KnoxShell in interactive mode.

java -jar bin/shell.jar

Each line below could be typed or copied into the interactive shell and executed. This is provided as an example to illustrate the use of the client DSL.

// Import the client DSL and a useful utilities for working with JSON.
import org.apache.knox.gateway.shell.Hadoop
import org.apache.knox.gateway.shell.hdfs.Hdfs
import groovy.json.JsonSlurper

// Setup some basic config.
gateway = "https://localhost:8443/gateway/sandbox"
username = "guest"
password = "guest-password"

// Start the session.
session = Hadoop.login( gateway, username, password )

// Cleanup anything leftover from a previous run.
Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()

// Upload the README to HDFS.
Hdfs.put( session ).file( "README" ).to( "/user/guest/example/README" ).now()

// Download the README from HDFS.
text = Hdfs.get( session ).from( "/user/guest/example/README" ).now().string
println text

// List the contents of the directory.
text = Hdfs.ls( session ).dir( "/user/guest/example" ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix

// Cleanup the directory.
Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()

// Clean the session.
session.shutdown()

WebHDFS via cURL¶

Users can use cURL to directly invoke the REST APIs via the gateway.

Optionally cleanup the sample directory in case a previous example was run without cleaning up.¶

curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'

Register the name for a sample file README in /user/guest/example.¶

curl -i -k -u guest:guest-password -X PUT \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/README?op=CREATE'

Upload README to /user/guest/example. Use the README in {GATEWAY_HOME}.¶

curl -i -k -u guest:guest-password -T README -X PUT \
    '{Value of Location header from command above}'

List the contents of the directory /user/guest/example.¶

curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=LISTSTATUS'

Request the content of the README file in /user/guest/example.¶

curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/README?op=OPEN'

Read the content of the file.¶

curl -i -k -u guest:guest-password -X GET \
    '{Value of Location header from command above}'

Optionally cleanup the example directory.¶

curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'

WebHDFS client DSL¶

get() - Get a file from HDFS (OPEN).¶

Request
- from( String name ) - The full name of the file in HDFS.
- file( String name ) - The name of a local file to create with the content. If this isn't specified the file content must be read from the response.
Response
- BasicResponse
- If file parameter specified content will be streamed to file.
Example
- Hdfs.get( session ).from( "/user/guest/example/README" ).now().string

ls() - Query the contents of a directory (LISTSTATUS)¶

Request
- dir( String name ) - The full name of the directory in HDFS.
Response
- BasicResponse
Example
- Hdfs.ls( session ).dir( "/user/guest/example" ).now().string

mkdir() - Create a directory in HDFS (MKDIRS)¶

Request
- dir( String name ) - The full name of the directory to create in HDFS.
- perm( String perm ) - The permissions for the directory (e.g. 644). Optional: default="777"
Response
- EmptyResponse - Implicit close().
Example
- Hdfs.mkdir( session ).dir( "/user/guest/example" ).now()

put() - Write a file into HDFS (CREATE)¶

Request
- text( String text ) - Text to upload to HDFS. Takes precedence over file if both present.
- file( String name ) - The name of a local file to upload to HDFS.
- to( String name ) - The fully qualified name to create in HDFS.
Response
- EmptyResponse - Implicit close().
Example
- Hdfs.put( session ).file( README ).to( "/user/guest/example/README" ).now()

rm() - Delete a file or directory (DELETE)¶

Request
- file( String name ) - The fully qualified file or directory name in HDFS.
- recursive( Boolean recursive ) - Delete directory and all of its contents if True. Optional: default=False
Response
- BasicResponse - Implicit close().
Example
- Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()

WebHDFS HA¶

Knox provides basic failover functionality for REST API calls made to WebHDFS when HDFS HA has been configured and enabled.

To enable HA functionality for WebHDFS in Knox the following configuration has to be added to the topology file.

<provider>
   <role>ha</role>
   <name>HaProvider</name>
   <enabled>true</enabled>
   <param>
       <name>WEBHDFS</name>
       <value>maxFailoverAttempts=3;failoverSleep=1000;enabled=true</value>
   </param>
</provider>

The role and name of the provider above must be as shown. The name in the 'param' section must match that of the service role name that is being configured for HA and the value in the 'param' section is the configuration for that particular service in HA mode. In this case the name is 'WEBHDFS'.

The various configuration parameters are described below:

maxFailoverAttempts - This is the maximum number of times a failover will be attempted. The failover strategy at this time is very simplistic in that the next URL in the list of URLs provided for the service is used and the one that failed is put at the bottom of the list. If the list is exhausted and the maximum number of attempts is not reached then the first URL that failed will be tried again (the list will start again from the original top entry).
failoverSleep - The amount of time in milliseconds that the process will wait or sleep before attempting to failover.
enabled - Flag to turn the particular service on or off for HA.

And for the service configuration itself the additional URLs should be added to the list. The active URL (at the time of configuration) should ideally be added to the top of the list.

<service>
    <role>WEBHDFS</role>
    <url>http://{host1}:50070/webhdfs</url>
    <url>http://{host2}:50070/webhdfs</url>
</service>