There are six steps for using hbase as gora-nutch backend.
First, you have to download Nutch 2 from a mirror site of Apache. Download and extraxt it to where you want to install.
# tar -zxvf apache-nutch-2.X-src.tar.gz
Second, you have to downlad and install Hbase from a mirror site of Apache. Now Gora 0.2 supports Hbase 0.90.X branch.
Third, Gora backend must be specified in nutch-site.xml. Before compiling nutch, all usual configuration parameters should be set in the nutch-site.xml.
<property>
First, you have to download Nutch 2 from a mirror site of Apache. Download and extraxt it to where you want to install.
# tar -zxvf apache-nutch-2.X-src.tar.gz
Second, you have to downlad and install Hbase from a mirror site of Apache. Now Gora 0.2 supports Hbase 0.90.X branch.
Third, Gora backend must be specified in nutch-site.xml. Before compiling nutch, all usual configuration parameters should be set in the nutch-site.xml.
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
Fourth, gora-hbase dependency must be available in the ivy/ivy.xml.
<!-- Uncomment this to use HBase as Gora backend. -->
Fourth, gora-hbase dependency must be available in the ivy/ivy.xml.
<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />
Fifth, in the gora.properties file default datastore must be specified as HBaseStore.
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Sixth, compile nutch. (Ant has to be available)
# ant runtime
Now, you should then be able to use nutch.
Fifth, in the gora.properties file default datastore must be specified as HBaseStore.
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Sixth, compile nutch. (Ant has to be available)
# ant runtime
Now, you should then be able to use nutch.
hi, I have followed your steps. But when I run the nutch crawl command I am getting below exception..
ReplyDeleteException in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local1224515128_0002
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
I use this command. PLease let me know your suggestions to fix this issue
Thanks,
RP
Hi, can you give full crawl command you run?
ReplyDeleteSecond, have you run nutch as distributed or standalone mode?
When you run nutch in $NUTCH_HOME/runtime/local folder that's standalone,
When you run nutch from $NUTCH_HOME/runtime/deploy that's distributed.
You may have forgotten specifying urls directory that should be injected for the first run.
Hi! I have the same error
ReplyDeleteRun nutch in standalone folder
Where is my script
bin/nutch crawl urls -depth 3 -topN 5
Hello! I have the same error, but it happens with fetch. Here's my stacktrace:
ReplyDeleteException in thread "main" java.lang.RuntimeException: job failed: name=fetch, jobid=job_local_0007
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:194)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:161)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
I ran the command bin/nutch crawl urls -depth 20 -topN 10000