org.apache.lucene.ant
public class HtmlDocument extends Object
HtmlDocument
class creates a Lucene {@link
org.apache.lucene.document.Document} from an HTML document. It does this by using JTidy package. It can take input input from {@link java.io.File} or {@link java.io.InputStream}.
Constructor Summary | |
---|---|
HtmlDocument(File file)
Constructs an HtmlDocument from a {@link
java.io.File}.
| |
HtmlDocument(InputStream is)
Constructs an HtmlDocument from an {@link
java.io.InputStream}.
|
Method Summary | |
---|---|
static Document | Document(File file)
Creates a Lucene Document from a {@link
java.io.File}.
|
String | getBody()
Gets the bodyText attribute of the
HtmlDocument object.
|
static Document | getDocument(InputStream is)
Creates a Lucene Document from an {@link
java.io.InputStream}.
|
String | getTitle()
Gets the title attribute of the HtmlDocument
object.
|
static void | main(String[] args)
Runs HtmlDocument on the files specified on
the command line.
|
HtmlDocument
from a {@link
java.io.File}.
Parameters: file the File
containing the
HTML to parse
Throws: IOException if an I/O exception occurs
Since:
HtmlDocument
from an {@link
java.io.InputStream}.
Parameters: is the InputStream
containing the HTML
Throws: IOException if I/O exception occurs
Since:
Document
from a {@link
java.io.File}.
Parameters: file
Returns:
Throws: IOException
HtmlDocument
object.
Returns: the bodyText value
Document
from an {@link
java.io.InputStream}.
Parameters: is
Returns:
Throws: IOException
HtmlDocument
object.
Returns: the title value
HtmlDocument
on the files specified on
the command line.
Parameters: args Command line arguments
Throws: Exception Description of Exception