Skip to main content

How to create hive external table for nutch's hbase webpage schema?

In order to query hbase table using hive, an external table should be created.

CREATE EXTERNAL TABLE webpage_hive (key string, baseUrl string, status int, prevFetchTime bigint, fetchTime bigint, fetchInterval bigint, retriesSinceFetch int, reprUrl string, content string, contentType string, protocolStatus string, modifiedTime bigint, prevModifiedTime bigint, batchId string, title string, text string, parseStatus int, signature string, prevSignature string, score int, headers map<string,string>, inlinks map<string,string>, outlinks map<string,string>, metadata map<string,string>, markers map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:bas,f:st,f:pts#b,f:ts#b,f:fi#b,f:rsf,f:rpr,f:cnt,f:typ,f:prot,f:mod#b,f:pmod#b,f:bid,p:t,p:c,p:st,p:sig,p:psig,s:s,h:,il:,ol:,mtdt:,mk:") TBLPROPERTIES ("" = "webpage");

after executing this statement columns are created like:

baseurl string from deserializer
batchid string from deserializer
content string from deserializer
contenttype string from deserializer
fetchinterval bigint from deserializer
fetchtime bigint from deserializer
headers map<string,string> from deserializer
inlinks map<string,string> from deserializer
key string from deserializer
markers map<string,string> from deserializer
metadata map<string,string> from deserializer
modifiedtime bigint from deserializer
outlinks map<string,string> from deserializer
parsestatus int from deserializer
prevfetchtime bigint from deserializer
prevmodifiedtime bigint from deserializer
prevsignature string from deserializer
protocolstatus string from deserializer
reprurl string from deserializer
retriessincefetch int from deserializer
score int from deserializer
signature string from deserializer
status int from deserializer
text string from deserializer
title string from deserializer

some of example queries are:

Following query converts bigint epoch to readable date format:
select baseurl,from_unixtime(fetchtime, "[dd/MM/yyyy:HH:mm:ss Z]") AS ft from webpage_hive order by baseurl desc;

Following query explode outlinks in a lateral view and displays as key,value pairs:
SELECT baseurl, outl_key,outl_value FROM webpage_hive LATERAL VIEW explode(outlinks) olTable AS outl_key,outl_value;


Popular posts from this blog

Creating Multiple VLANs over Bonding Interfaces with Proper Routing on a Centos Linux Host

In this post, I am going to explain configuring multiple VLANs on a bond interface. First and foremost, I would like to describe the environment and give details of the infrastructure. The server has 4 Ethernet links to a layer 3 switch with names: enp3s0f0, enp3s0f1, enp4s0f0, enp4s0f1 There are two bond interfaces both configured as active-backup bond0, bond1 enp4s0f0 and enp4s0f1 interfaces are bonded as bond0. Bond0 is for making ssh connections and management only so corresponding switch ports are not configured in trunk mode. enp3s0f0 and enp3s0f1 interfaces are bonded as bond1. Bond1 is for data and corresponding switch ports are configured in trunk mode. Bond0 is the default gateway for the server and has IP address Bond1 has three subinterfaces with VLAN 4, 36, 41. IP addresses are,, respectively. Proper communication with other servers on the network we should use routing tables. There are three

Sending Jboss Server Logs to Logstash Using Filebeat with Multiline Support

In addition to sending system logs to logstash, it is possible to add a prospector section to the filebeat.yml for jboss server logs. Sometimes jboss server.log has single events made up from several lines of messages. In such cases Filebeat should be configured for a multiline prospector. Filebeat takes lines do not start with a date pattern (look at pattern in the multiline section "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}" and negate section is set to true ) and combines them with the previous line that starts with a date pattern. server.log file excerpt where DatePattern: yyyy-MM-dd-HH and ConversionPattern: %d %-5p [%c] %m%n Logstash filter: