Experimenting with Solr Cell

The Death of the Old Site Search

When it was announced that Google Site Search was shutting down, we were faced with the unfortunate reality that several of our older sites were no longer going to have a commercial free site search unless we worked on an alternative. Since we have used Solr for many of our sites, it seemed logical to look into using it to power a search, but we were left with the problem of how to get entire websites INTO Solr.

Though there are more powerful crawlers out there, I was lured by the built-in capabilities of Solr Cell, which ships with the Solr install.  While working on indexing our sites, the dev team learned an awful lot of lessons about the parameter types and how to set up our Solr index in order to use the scraped website documents meaningfully.  We used Solr 5.5.5 for the purposes of this experiment.

 

Getting Set Up

The first step is to get Solr installed.  Once you’ve got Solr, set up a core named “site_search” with the following (location of your install may be different):

sudo su solr
cd /opt/solr/bin
./solr create_core -c site_search

Then we needed to customize the schema.  We wanted fields collected by the crawler to be text fields so that they are searchable, and we copied a few “string” fields into text fields.

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field":{
     "name":"text",
     "type":"text_general",
     "stored":true,
     "indexed":true
  },
  "add-field":{
    "name":"body",
    "type":"text_general",
    "stored":false,
    "indexed":true
  },
  "add-copy-field":{
    "source":"body",
    "dest":"text"
  },
  "add-field":{
     "name":"title",
     "type":"text_general",
     "stored":true,
     "indexed":true
  },
  "add-copy-field":{
    "source":"title",
    "dest":"text"
  },
  "add-field":{
    "name":"url_text",
    "type":"text_general",
    "stored":true,
    "indexed":true
  },
  "add-copy-field":{
    "source":"url",
    "dest":"url_text"
  }
}' http://localhost:8983/solr/site_search/schema

Crawling the Site and Some Neat Tricks

So this is where things get interesting.  The documentation from running ./post -h is pretty decent except for the part where it doesn’t explain what parameters you can pass in.  For those, you have to consult the Solr Cell docs.  Here’s a pretty basic version of the command that would get somebody off on the right start:

./post -c site_search https://cather.unl.edu/  \
    -recursive 2 \
    -delay 1 \
    -params "capture=body&captureAttr=true"

Above, you’re doing the following

  • posting to the site_search core
  • with the results of crawling cather.unl.edu
  • two layers deep
  • one page per second
  • by grabbing the <body> tag’s contents and “capturing attributes”

When we first ran the indexing skip, we ran it without the captureAttr parameter.  We wanted to grab the human readable text contents of the <body> tag, not all of the class names and element ids and falderal. It took us a long time to figure out that captureAttr was, in fact, the parameter for us.  The docs for captureAttr say that attributes will be indexed into separate fields, but do not mention that the attributes will NOT be indexed into the main capture field!  What a discovery!!! Much easier than trying to coax the xpath parameter into selecting only element text and not attributes (a plan I was never able to get working).

For one project I tested, we needed to use a custom URL to post to Solr instead of using localhost:8983, as Solr Cell was attempting by default.  This is the URL that I managed to get working by comparing it to the default request:

./post -url https://server_name.unl.edu/solr/site_search/update

How It Turned Out

So the great news is that it totally worked, basically out of the box, to create a rudimentary site search.  We were able to use our custom url_text field to create facet queries for sections of the site.  The bad news is that we wound up with something like 40,000 pages of The Whitman Archive indexed.  40K?  Evidently, those Whitman scholars have been busier than we thought!  After some investigation, we discovered that relative links were causing the script to index the same page multiple times.

"docs": [
      { "id": "https://whitmanarchive.org/about/../about/editorial.html" },
      { "id": "https://whitmanarchive.org/about/../manuscripts/transcriptions/../../about/editorial.html" },
      { "id": "https://whitmanarchive.org/about/../manuscripts/notebooks/../../about/editorial.html" },
      { "id": "https://whitmanarchive.org/about/../manuscripts/transcriptions/../../
                 manuscripts/notebooks/../../about/editorial.html" },
      { "id": "https:/whitmanarchive.org/about/follow.html" }
]

That issue was the largest problem we encountered with Solr Cell.  As a crawler, it did not seem as intelligent as we hoped it would be about identifying unique pages and links.  In terms of incorporating aspects of Solr that we are used to, Solr Cell did a fine job of sucking in content for highlighting and filtering (with the modifications to the schema, that is).

Though it appears that we will likely not be moving forward with Solr Cell to replace Google Site Search, we learned a lot through this experience!  Hopefully this blogpost can help somebody else with the same captureAttr issue that we had.