On my system, the modified lines look like this: Change the value in quotations for each of these to the full path for pdftotext and pdfinfo, respectively. Scroll down until you see my $PDFTOTEXT and my $PDFINFO. PDF files require a little extra help, so open in your text editor. You need to configure one more script from doc2html. When you’re finished, save and close the file. Do not comment out the variables that point to utilities you don’t have. This line - # (comment out those you don't have) # - is wrong. Pay special attention to this, because the instructions included in nf are incorrect. Lines that point to missing conversion utilities should be left alone. The configuration file includes variables for WordPerfect, Flash, Shockwave, and rich text file types as well, but you do not have to install and configure every utility available. For PostScript: my $CATPS = '/usr/bin/ps2ascii'.For PDF: my $PDF2HTML = '/opt/local/htdig/scripts/'.For Microsoft PowerPoint: my $PPT2HTML = '/usr/bin/ppthtml'.For Microsoft Excel: my $XLS2HTML = '/usr/bin/xlhtml'.Double-check the path for each application on your system and correct them if necessary. Now activate the rest of the utilities in the same fashion. For my installation, the modified line looks like this: Go to the second line, and insert the path for your installation of catdoc between the quotation marks. #version of catdoc for Word6, Word7 & Word97 files: Let’s do the conversion utility for Microsoft Word documents first. Before you can use these utilities, you must specify the full path to where it’s installed in the appropriate variable. # (comment out those you don't have) #īelow them is a list of variables that call the conversion utilities. Open it in your favorite text editor and scroll down until you see these lines: The main script in the doc2html collection is. That way, you won’t need to change as many paths when we configure the script. I recommend you unpack the archive to /opt/local/htdig/scripts/. Download it from ht://Dig’s parsers directory. This collection of Perl scripts serves as a go-between for ht://Dig and the other utilities. (xlHtml is slightly more difficult to find, but good binaries are out there.) Install them from your CDs or favorite repository.įinally, you need doc2html. All four are available as binary packages for the major distros. These conversion utilities act as plug-ins for ht://Dig by converting foreign file types to plain text. But first, you need to install the following additional packages: This tutorial includes instructions for indexing PostScript, PDF, Microsoft Word, Microsoft PowerPoint, and Microsoft Excel Files. Fortunately, a number of conversion utilities can expand its reach. Out of the box, ht://Dig is limited to searching plain text and HTML files. As of this writing, the most recent production version is 3.1.6. Download and install both from your favorite repository, or binaries and source code are available from the project’s site. Most split the program into two packages: htdig, which contains the command-line utilities, and htdig-web, which contains the CGI script. Ht://Dig is available as a set of stable binary packages for all the major distros. Unlike some search utilities, it maintains its database in plain text files, keeping software dependencies low. Like Google, ht://Dig can search PDF, PostScript, Microsoft Word, Microsoft Excel, and Microsoft PowerPoint files, in addition to the expected plain text and HTML files. Properly configured, they work together to form a robust, extensible search engine for a domain or intranet. It combines a powerful collection of command-line search utilities with an easy-to-use CGI script. Ht://Dig is more than a simple search script for a Web site. Unfortunately, serving pages is one thing - finding them is another. Most Linux users know how easily they can run a Web server on their favorite distros.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |