WebHCat
WebHCat¶
WebHCat (also called Templeton) is a related but separate service from HiveServer2.
As such it is installed and configured independently.
The WebHCat wiki pages describe this processes.
In sandbox this configuration file for WebHCat is located at /etc/hadoop/hcatalog/webhcat-site.xml
.
Note the properties shown below as they are related to configuration required by the gateway.
<property>
<name>templeton.port</name>
<value>50111</value>
</property>
Also important is the configuration of the JOBTRACKER RPC endpoint.
For Hadoop 2 this can be found in the yarn-site.xml
file.
In Sandbox this file can be found at /etc/hadoop/conf/yarn-site.xml
.
The property yarn.resourcemanager.address
within that file is relevant for the gateway's configuration.
<property>
<name>yarn.resourcemanager.address</name>
<value>sandbox.hortonworks.com:8050</value>
</property>
See #[WebHDFS] for details about locating the Hadoop configuration for the NAMENODE endpoint.
The gateway by default includes a sample topology descriptor file {GATEWAY_HOME}/deployments/sandbox.xml
.
The values in this sample are configured to work with an installed Sandbox VM.
<service>
<role>NAMENODE</role>
<url>hdfs://localhost:8020</url>
</service>
<service>
<role>JOBTRACKER</role>
<url>rpc://localhost:8050</url>
</service>
<service>
<role>WEBHCAT</role>
<url>http://localhost:50111/templeton</url>
</service>
The URLs provided for the role NAMENODE and JOBTRACKER do not result in an endpoint being exposed by the gateway. This information is only required so that other URLs can be rewritten that reference the appropriate RPC address for Hadoop services. This prevents clients from needing to be aware of the internal cluster details. Note that for Hadoop 2 the JOBTRACKER RPC endpoint is provided by the Resource Manager component.
By default the gateway is configured to use the HTTP endpoint for WebHCat in the Sandbox. This could alternatively be configured to use the HTTPS endpoint by providing the correct address.
WebHCat URL Mapping¶
For WebHCat URLs, the mapping of Knox Gateway accessible URLs to direct WebHCat URLs is simple.
| ------- | ------------------------------------------------------------------------------- |
| Gateway | https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton
|
| Cluster | http://{webhcat-host}:{webhcat-port}/templeton}
|
WebHCat via cURL¶
Users can use cURL to directly invoke the REST APIs via the gateway. For the full list of available REST calls look at the WebHCat documentation. This is a simple curl command to test the connection:
curl -i -k -u guest:guest-password 'https://localhost:8443/gateway/sandbox/templeton/v1/status'
WebHCat Example¶
This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. There are several ways to do this depending upon your preference.
You can use the "embedded" Groovy interpreter provided with the distribution.
java -jar bin/shell.jar samples/ExampleWebHCatJob.groovy
You can manually type in the KnoxShell DSL script into the "embedded" Groovy interpreter provided with the distribution.
java -jar bin/shell.jar
Each line from the file samples/ExampleWebHCatJob.groovy
would then need to be typed or copied into the interactive shell.
WebHCat Client DSL¶
submitJava() - Submit a Java MapReduce job.¶
- Request
- jar (String) - The remote file name of the JAR containing the app to execute.
- app (String) - The app name to execute. This is wordcount for example not the class name.
- input (String) - The remote directory name to use as input for the job.
- output (String) - The remote directory name to store output from the job.
- Response
- jobId : String - The job ID of the submitted job. Consumes body.
-
Example
Job.submitJava(session) .jar(remoteJarName) .app(appName) .input(remoteInputDir) .output(remoteOutputDir) .now() .jobId
submitPig() - Submit a Pig job.¶
- Request
- file (String) - The remote file name of the pig script.
- arg (String) - An argument to pass to the script.
- statusDir (String) - The remote directory to store status output.
- Response
- jobId : String - The job ID of the submitted job. Consumes body.
- Example
Job.submitPig(session).file(remotePigFileName).arg("-v").statusDir(remoteStatusDir).now()
submitHive() - Submit a Hive job.¶
- Request
- file (String) - The remote file name of the hive script.
- arg (String) - An argument to pass to the script.
- statusDir (String) - The remote directory to store status output.
- Response
- jobId : String - The job ID of the submitted job. Consumes body.
- Example
Job.submitHive(session).file(remoteHiveFileName).arg("-v").statusDir(remoteStatusDir).now()
submitSqoop Job API¶
Using the Knox DSL, you can now easily submit and monitor Apache Sqoop jobs. The WebHCat Job class now supports the submitSqoop
command.
Job.submitSqoop(session)
.command("import --connect jdbc:mysql://hostname:3306/dbname ... ")
.statusDir(remoteStatusDir)
.now().jobId
The submitSqoop
command supports the following arguments:
- command (String) - The sqoop command string to execute.
- files (String) - Comma separated files to be copied to the templeton controller job.
- optionsfile (String) - The remote file which contain Sqoop command need to run.
- libdir (String) - The remote directory containing jdbc jar to include with sqoop lib
- statusDir (String) - The remote directory to store status output.
A complete example is available here: https://cwiki.apache.org/confluence/display/KNOX/2016/11/08/Running+SQOOP+job+via+KNOX+Shell+DSL
queryQueue() - Return a list of all job IDs registered to the user.¶
- Request
- No request parameters.
- Response
- BasicResponse
- Example
Job.queryQueue(session).now().string
queryStatus() - Check the status of a job and get related job information given its job ID.¶
- Request
- jobId (String) - The job ID to check. This is the ID received when the job was created.
- Response
- BasicResponse
- Example
Job.queryStatus(session).jobId(jobId).now().string
WebHCat HA¶
Please look at #[Default Service HA support]