[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Converting HTML to DocBook

>>>>> "j" == juro  <xvudpapc@savba.sk> writes:

    j> Hello, I'd like to ask you if there's an easy way (a convertor)
    j> to convert a html file to sgml.  If yes, where can I get it.

If you really mean SGML, the answer is trivial: HTML _is_ an SGML so
the identity transformation will do the trick ;) I suspect, however
that you meant to ask if we can translate HTML into even a crude form
of DocBook DTD SGML.

I hope to be corrected, but the answer is apparently 'no' and
undaunted by the lack of an answer on many mailing lists, I attempted
to create one as a one-evening project; my experience was enough to
demonstrate the situation is non-trivial in the extreme.

For example, even if the HTML is _very_ well-behaved, we have to
regularly interpret constructs such as

    <H1>the first section</H1>
    <P>this is the first section</p>
    <H2>and a subsection</h2>
    <p>with some text</p>
    <H1>the next section</h1)

and turn it into 

    <sect1><title>the first section</title>
    <para>this is the first section</para>
    <sect2><title>and a subsection</title>
    <para>with some text</para>
    <sect1> ...

There is no equivalent to the sectN tag in HTML, and the fundamental
differences only begin with this first most elemental element.
Consider the logic of parsing the <A> tag, which must be different for
NAME and HREF types, or the nightmare of tables within tables.  Even
the <HEAD> does not really map to <ARTHEADER> and HTML is full of
non-containers which contain the contained information inside
attributes rather than within container tags (whew)

Ok, it helped me a little, so I am including my stylesheet with this
message --- be forwarned it is really sloppy because I was just toying
around with it; it is a composite of several postings on the DSSSL
list spliced together with no concern for consistent case or aesthetic
style.  I estimate it saved me maybe 20% of the total time to
translate the kerneld HOWTO, and given that small gain, I really
wonder if it was worth it.  Still, nothing is ever a complete failure:
it can always be used as a bad example.

   HTML to Docbook transformation by Gary Lawrence Murphy 
   with all the ideas from many other people.  It doesn't work
   but it does save some of the grunt work in moving a well-behaved
   HTML file to a format that is DocBook-like.  Don't expect

   This stylesheet was simply spliced together from comments and
   musings by various authors on the dssslist at mulberrytech.com
   I cannot rightly claim copyright and only include the license
   below to clarify the free nature of this work.

   The A tag support is broken horribly, as is ULINK, but the UL
   support alone makes it somewhat useful.

 USE: jade -d html2db.dsl HTMLFILE > SGMLFILE

   you may need to remove any DTD lines from the start of your HTML


 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License
 along with this program; if not, write to the Free Software
 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


<!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">

<style-specification id="html2db">

(declare-flow-object-class element
       "UNREGISTERED::James Clark//Flow Object Class::element")

(define (copy-attributes #!optional (nd (current-node)))
  (let loop ((atts (named-node-list-names (attributes nd))))
    (if (null? atts)
        (let* ((name (car atts))
               (value (attribute-string name nd)))
          (if value
              (cons (list name value)
                    (loop (cdr atts)))
              (loop (cdr atts)))))))

  (let* ((old-gi (gi (current-node)))
          (case old-gi

            (("HTML") "article")
            (("HEAD") "artheader")
            (("BODY") "sect1")
            (("HR") empty-sosofo)
            (("PRE") "screenshot")
            (("UL") "itemizedlist")
            (("I")  "emphasis")
            (("STRONG")  "emphasis")
            (("B")  "emphasis")
            (("TT") "command")
            (("P") "para")
            (("MENU") "itemizedlist")
            (else old-gi))))
    (make element
      gi: new-gi
      attributes: (copy-attributes))))

(element A
  (let ((attr (list
               (if (attribute-string "NAME")
                   (list "ID" (attribute-string "NAME"))
               (if (attribute-string "HREF")
                   (list "ULINK" (attribute-string "HREF"))
    (make element gi: "A"
          attributes: attr

(element LI
  (make element gi: "listitem"
        (make element gi: "para"

(element H1
  (make element gi: "sect2"
        (make element gi: "title"

(element H2
  (make element gi: "sect3"
        (make element gi: "title"

(element H3
  (make element gi: "sect4"
        (make element gi: "title"


Gary Lawrence Murphy <garym@linux.ca>: office voice/fax: 01 519 4222723
TCI - Business Innovations through Open Source : http://www.teledyn.com
Canadian Co-ordinators for Bynari International : http://ca.bynari.net/
Free Internet for a Free O/S? - http://www.teledyn.com/products/FreeWWW/