Skip to main content

Integrating Apache Nutch and Hbase using Gora

There are six steps for using hbase as gora-nutch backend.

First, you have to download Nutch 2 from a mirror site of Apache. Download and extraxt it to where you want to install.

# tar -zxvf apache-nutch-2.X-src.tar.gz

Second, you have to downlad and install Hbase from a mirror site of Apache. Now Gora 0.2 supports Hbase 0.90.X branch.

Third, Gora backend must be specified in nutch-site.xml. Before compiling nutch, all usual configuration parameters should be set in the nutch-site.xml.

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

Fourth, gora-hbase dependency must be available in the ivy/ivy.xml.

<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

Fifth, in the gora.properties file default datastore must be specified as HBaseStore.

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Sixth, compile nutch. (Ant has to be available)

# ant runtime

Now, you should then be able to use nutch.


Comments

  1. hi, I have followed your steps. But when I run the nutch crawl command I am getting below exception..

    Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local1224515128_0002
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

    I use this command. PLease let me know your suggestions to fix this issue

    Thanks,
    RP

    ReplyDelete
  2. Hi, can you give full crawl command you run?
    Second, have you run nutch as distributed or standalone mode?
    When you run nutch in $NUTCH_HOME/runtime/local folder that's standalone,
    When you run nutch from $NUTCH_HOME/runtime/deploy that's distributed.
    You may have forgotten specifying urls directory that should be injected for the first run.

    ReplyDelete
  3. Hi! I have the same error
    Run nutch in standalone folder
    Where is my script
    bin/nutch crawl urls -depth 3 -topN 5



    ReplyDelete
  4. Hello! I have the same error, but it happens with fetch. Here's my stacktrace:

    Exception in thread "main" java.lang.RuntimeException: job failed: name=fetch, jobid=job_local_0007
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:194)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:161)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
    I ran the command bin/nutch crawl urls -depth 20 -topN 10000

    ReplyDelete

Post a Comment

Popular posts from this blog

Creating Multiple VLANs over Bonding Interfaces with Proper Routing on a Centos Linux Host

In this post, I am going to explain configuring multiple VLANs on a bond interface. First and foremost, I would like to describe the environment and give details of the infrastructure. The server has 4 Ethernet links to a layer 3 switch with names: enp3s0f0, enp3s0f1, enp4s0f0, enp4s0f1 There are two bond interfaces both configured as active-backup bond0, bond1 enp4s0f0 and enp4s0f1 interfaces are bonded as bond0. Bond0 is for making ssh connections and management only so corresponding switch ports are not configured in trunk mode. enp3s0f0 and enp3s0f1 interfaces are bonded as bond1. Bond1 is for data and corresponding switch ports are configured in trunk mode. Bond0 is the default gateway for the server and has IP address 10.1.10.11 Bond1 has three subinterfaces with VLAN 4, 36, 41. IP addresses are 10.1.3.11, 10.1.35.11, 10.1.40.11 respectively. Proper communication with other servers on the network we should use routing tables. There are three

Sending Jboss Server Logs to Logstash Using Filebeat with Multiline Support

In addition to sending system logs to logstash, it is possible to add a prospector section to the filebeat.yml for jboss server logs. Sometimes jboss server.log has single events made up from several lines of messages. In such cases Filebeat should be configured for a multiline prospector. Filebeat takes lines do not start with a date pattern (look at pattern in the multiline section "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}" and negate section is set to true ) and combines them with the previous line that starts with a date pattern. server.log file excerpt where DatePattern: yyyy-MM-dd-HH and ConversionPattern: %d %-5p [%c] %m%n Logstash filter: