Skip to main content

How to create hive external table for nutch's hbase webpage schema?

In order to query hbase table using hive, an external table should be created.

CREATE EXTERNAL TABLE webpage_hive (key string, baseUrl string, status int, prevFetchTime bigint, fetchTime bigint, fetchInterval bigint, retriesSinceFetch int, reprUrl string, content string, contentType string, protocolStatus string, modifiedTime bigint, prevModifiedTime bigint, batchId string, title string, text string, parseStatus int, signature string, prevSignature string, score int, headers map<string,string>, inlinks map<string,string>, outlinks map<string,string>, metadata map<string,string>, markers map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:bas,f:st,f:pts#b,f:ts#b,f:fi#b,f:rsf,f:rpr,f:cnt,f:typ,f:prot,f:mod#b,f:pmod#b,f:bid,p:t,p:c,p:st,p:sig,p:psig,s:s,h:,il:,ol:,mtdt:,mk:") TBLPROPERTIES ("" = "webpage");

after executing this statement columns are created like:

baseurl string from deserializer
batchid string from deserializer
content string from deserializer
contenttype string from deserializer
fetchinterval bigint from deserializer
fetchtime bigint from deserializer
headers map<string,string> from deserializer
inlinks map<string,string> from deserializer
key string from deserializer
markers map<string,string> from deserializer
metadata map<string,string> from deserializer
modifiedtime bigint from deserializer
outlinks map<string,string> from deserializer
parsestatus int from deserializer
prevfetchtime bigint from deserializer
prevmodifiedtime bigint from deserializer
prevsignature string from deserializer
protocolstatus string from deserializer
reprurl string from deserializer
retriessincefetch int from deserializer
score int from deserializer
signature string from deserializer
status int from deserializer
text string from deserializer
title string from deserializer

some of example queries are:

Following query converts bigint epoch to readable date format:
select baseurl,from_unixtime(fetchtime, "[dd/MM/yyyy:HH:mm:ss Z]") AS ft from webpage_hive order by baseurl desc;

Following query explode outlinks in a lateral view and displays as key,value pairs:
SELECT baseurl, outl_key,outl_value FROM webpage_hive LATERAL VIEW explode(outlinks) olTable AS outl_key,outl_value;


Popular posts from this blog

Creating Multiple VLANs over Bonding Interfaces with Proper Routing on a Centos Linux Host

In this post, I am going to explain configuring multiple VLANs on a bond interface. First and foremost, I would like to describe the environment and give details of the infrastructure. The server has 4 Ethernet links to a layer 3 switch with names: enp3s0f0, enp3s0f1, enp4s0f0, enp4s0f1 There are two bond interfaces both configured as active-backup bond0, bond1 enp4s0f0 and enp4s0f1 interfaces are bonded as bond0. Bond0 is for making ssh connections and management only so corresponding switch ports are not configured in trunk mode. enp3s0f0 and enp3s0f1 interfaces are bonded as bond1. Bond1 is for data and corresponding switch ports are configured in trunk mode. Bond0 is the default gateway for the server and has IP address Bond1 has three subinterfaces with VLAN 4, 36, 41. IP addresses are,, respectively. Proper communication with other servers on the network we should use routing tables. There are three

PowerShell Script for Switching Between Multiple Windows

Windows PowerShell has strong capabilities. I have a separate computer with a big lcd screen in which I am watching regularly some web based monitoring applications. So I need those application windows switch between on a timely basis. Then I wrote this simple powershell script to achieve this. You can change it according to your needs.