{"id":304,"date":"2018-05-29T13:42:36","date_gmt":"2018-05-29T18:42:36","guid":{"rendered":"https:\/\/cdrhdev.unl.edu\/log\/?p=304"},"modified":"2025-10-06T14:39:36","modified_gmt":"2025-10-06T19:39:36","slug":"experimenting-with-solr-cell","status":"publish","type":"post","link":"https:\/\/cdrhdev.unl.edu\/log\/2018\/experimenting-with-solr-cell\/","title":{"rendered":"Experimenting with Solr Cell"},"content":{"rendered":"<h2>The Death of the Old Site Search<\/h2>\n<p>When it was announced that Google Site Search was <a href=\"https:\/\/enterprise.google.com\/search\/products\/gss.html\">shutting down<\/a>, we were faced with the unfortunate reality that several of our older sites were no longer going to have a commercial free site search unless we worked on an alternative. Since we have used <a href=\"http:\/\/lucene.apache.org\/solr\/\">Solr<\/a> for many of our sites, it seemed logical to look into using it to power a search, but we were left with the problem of how to get entire websites INTO Solr.<\/p>\n<p>Though there are more powerful crawlers out there, I was lured by the built-in capabilities of <a href=\"https:\/\/lucene.apache.org\/solr\/guide\/6_6\/uploading-data-with-solr-cell-using-apache-tika.html\">Solr Cell<\/a>, which ships with the Solr install.\u00a0 While working on indexing our sites, the dev team learned an awful lot of lessons about the parameter types and how to set up our Solr index in order to use the scraped website documents meaningfully.\u00a0 We used Solr\u00a05.5.5 for the purposes of this experiment.<\/p>\n<h2>Getting Set Up<\/h2>\n<p>The first step is to get Solr installed.\u00a0 Once you&#8217;ve got Solr, set up a core named &#8220;site_search&#8221; with the following (location of your install may be different):<\/p>\n<pre class=\"lang:sh decode:true\" title=\"Create Solr Core\">sudo su solr\ncd \/opt\/solr\/bin\n.\/solr create_core -c site_search<\/pre>\n<p>Then we needed to customize the schema.\u00a0 We wanted fields collected by the crawler to be text fields so that they are searchable, and we copied a few &#8220;string&#8221; fields into text fields.<\/p>\n<pre class=\"lang:sh decode:true\" title=\"Configure Schema\">curl -X POST -H 'Content-type:application\/json' --data-binary '{\n  \"add-field\":{\n     \"name\":\"text\",\n     \"type\":\"text_general\",\n     \"stored\":true,\n     \"indexed\":true\n  },\n  \"add-field\":{\n    \"name\":\"body\",\n    \"type\":\"text_general\",\n    \"stored\":false,\n    \"indexed\":true\n  },\n  \"add-copy-field\":{\n    \"source\":\"body\",\n    \"dest\":\"text\"\n  },\n  \"add-field\":{\n     \"name\":\"title\",\n     \"type\":\"text_general\",\n     \"stored\":true,\n     \"indexed\":true\n  },\n  \"add-copy-field\":{\n    \"source\":\"title\",\n    \"dest\":\"text\"\n  },\n  \"add-field\":{\n    \"name\":\"url_text\",\n    \"type\":\"text_general\",\n    \"stored\":true,\n    \"indexed\":true\n  },\n  \"add-copy-field\":{\n    \"source\":\"url\",\n    \"dest\":\"url_text\"\n  }\n}' http:\/\/localhost:8983\/solr\/site_search\/schema<\/pre>\n<h2>Crawling the Site and Some Neat Tricks<\/h2>\n<p>So this is where things get interesting.\u00a0 The documentation from running\u00a0<span class=\"lang:default decode:true crayon-inline\">.\/post -h<\/span>\u00a0is pretty decent except for the part where it doesn&#8217;t explain what parameters you can pass in.\u00a0 For those, you have to consult the <a href=\"https:\/\/lucene.apache.org\/solr\/guide\/6_6\/uploading-data-with-solr-cell-using-apache-tika.html\">Solr Cell<\/a> docs.\u00a0 Here&#8217;s a pretty basic version of the command that would get somebody off on the right start:<\/p>\n<pre class=\"lang:default decode:true\" title=\"Basic Crawl\">.\/post -c site_search https:\/\/cather.unl.edu\/  \\\n    -recursive 2 \\\n    -delay 1 \\\n    -params \"capture=body&amp;captureAttr=true\"<\/pre>\n<p>Above, you&#8217;re doing the following<\/p>\n<ul>\n<li>posting to the site_search core<\/li>\n<li>with the results of crawling cather.unl.edu<\/li>\n<li>two layers deep<\/li>\n<li>one page per second<\/li>\n<li>by grabbing the &lt;body&gt; tag&#8217;s contents and &#8220;capturing attributes&#8221;<\/li>\n<\/ul>\n<p>When we first ran the indexing skip, we ran it without the <code>captureAttr<\/code> parameter.\u00a0 We wanted to grab the human readable text contents of the <code>&lt;body&gt;<\/code> tag, not all of the class names and element ids and falderal. It took us a long time to figure out that <code>captureAttr<\/code> was, in fact, the parameter for us.\u00a0 The docs for <code>captureAttr<\/code> say that attributes will be indexed into separate fields, but do not mention that the attributes will NOT be indexed into the main capture field!\u00a0 What a discovery!!! Much easier than trying to coax the <code>xpath<\/code> parameter into selecting only element text and not attributes (a plan I was never able to get working).<\/p>\n<p>For one project I tested, we needed to use a custom URL to post to Solr instead of using localhost:8983, as Solr Cell was attempting by default.\u00a0 This is the URL that I managed to get working by comparing it to the default request:<\/p>\n<pre class=\"lang:default decode:true \" title=\"Override URL\">.\/post -url https:\/\/server_name.unl.edu\/solr\/site_search\/update<\/pre>\n<h2>How It Turned Out<\/h2>\n<p>So the great news is that it totally worked, basically out of the box, to create a rudimentary site search.\u00a0 We were able to use our custom <code>url_text<\/code>\u00a0field to create facet queries for sections of the site.\u00a0 The bad news is that we wound up with something like 40,000 pages of The Whitman Archive indexed.\u00a0 40K?\u00a0 Evidently, those Whitman scholars have been busier than we thought!\u00a0 After some investigation, we discovered that relative links were causing the script to index the same page multiple times.<\/p>\n<pre class=\"lang:js decode:true\">\"docs\": [\n      { \"id\": \"https:\/\/whitmanarchive.org\/about\/..\/about\/editorial.html\" },\n      { \"id\": \"https:\/\/whitmanarchive.org\/about\/..\/manuscripts\/transcriptions\/..\/..\/about\/editorial.html\" },\n      { \"id\": \"https:\/\/whitmanarchive.org\/about\/..\/manuscripts\/notebooks\/..\/..\/about\/editorial.html\" },\n      { \"id\": \"https:\/\/whitmanarchive.org\/about\/..\/manuscripts\/transcriptions\/..\/..\/\n                 manuscripts\/notebooks\/..\/..\/about\/editorial.html\" },\n      { \"id\": \"https:\/whitmanarchive.org\/about\/follow.html\" }\n]<\/pre>\n<p>That issue was the largest problem we encountered with Solr Cell.\u00a0 As a crawler, it did not seem as intelligent as we hoped it would be about identifying unique pages and links.\u00a0 In terms of incorporating aspects of Solr that we are used to, Solr Cell did a fine job of sucking in content for highlighting and filtering (with the modifications to the schema, that is).<\/p>\n<p>Though it appears that we will likely not be moving forward with Solr Cell to replace Google Site Search, we learned a lot through this experience!\u00a0 Hopefully this blogpost can help somebody else with the same <code>captureAttr<\/code>\u00a0issue that we had.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Death of the Old Site Search When it was announced that Google Site Search was shutting down, we were faced with the unfortunate reality that several of our older sites were no longer going to have a commercial free site search unless we worked on an alternative. Since we have used Solr for many&hellip;<\/p>\n <a href=\"https:\/\/cdrhdev.unl.edu\/log\/2018\/experimenting-with-solr-cell\/\" title=\"Experimenting with Solr Cell\" class=\"entry-more-link\"><span>Read More<\/span> <span class=\"screen-reader-text\">Experimenting with Solr Cell<\/span><\/a>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"Layout":"","footnotes":""},"categories":[37,40],"tags":[45,27],"class_list":["entry","author-jdussault","post-304","post","type-post","status-publish","format-standard","category-research","category-utilities","tag-search","tag-solr"],"_links":{"self":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts\/304","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/comments?post=304"}],"version-history":[{"count":8,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts\/304\/revisions"}],"predecessor-version":[{"id":693,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts\/304\/revisions\/693"}],"wp:attachment":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/media?parent=304"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/categories?post=304"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/tags?post=304"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}