Get the source code from an html file

Question

I am wondering if you could please help with generating .cpp/.h file from the following html file in a programmatic way (using whatever scripting language, or programming language, or even using editors such as vi or emacs):

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title>Class</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body link="blue" vlink="purple" bgcolor="#FFFABB" text="black">

<h2><font face="Helvetica">Code Fragment: Class</font></h2>
</center><br><dl><dd><pre>

  <font color=#A000A0>template</font> &lt;<font color=#A000A0>typename</font> G&gt;
  <font color=#A000A0>class</font> Components : <font color=#A000A0>public</font> DFS&lt;G&gt; {            <font color=#0000FF>// count components</font>
  <font color=#A000A0>private</font>:
    <font color=#A000A0>int</font> nComponents;                 <font color=#0000FF>// num of components</font>
  <font color=#A000A0>public</font>:
    <font color=#000000>Components</font>(<font color=#A000A0>const</font> G& g): DFS&lt;G&gt;(g) {}        <font color=#0000FF>// constructor</font>
    <font color=#A000A0>int</font> <font color=#A000A0>operator</font>()();                 <font color=#0000FF>// count components</font>
  };
</dl>

</body>
</html>

If you could please point out how this was done in the other direction too, that would be great. Thanks a lot.

You want a tool to copy the highlighted text in an HTML page? — Kijewski
– Kijewski, Commented Sep 21, 2011 at 22:55
@Keith: not sure why you asked that. I just want to be able to switch between this kind of html representation of my c++ code and vice versa. I am asking the programmatic way or any tools that I can use to do that quickly in batch mode. — Qiang Li
– Qiang Li, Commented Sep 21, 2011 at 23:12
@Qiang: Oh, I see what you mean. I didn't see past the HTML tags to notice that the HTML is a representation of C++ code, so I didn't think the idea of translating HTML to C++ made much sense. Nevcer mind. — Keith Thompson
– Keith Thompson, Commented Sep 22, 2011 at 2:44

jman · Accepted Answer · 2011-09-21 23:08:38Z

8

Does this work for you?

[18:56:44 jaidev@~]$ lynx --dump foo.html
Code Fragment: Class


  template <typename G>
  class Components : public DFS<G> {            // count components
  private:
    int nComponents;                 // num of components
  public:
    Components(const G& g): DFS<G>(g) {}        // constructor
    int operator()();                 // count components
  };
[18:56:49 jaidev@~]$

Edit:

For the reverse direction. If you use vim as your editor, you can enter :TOhtml to generate a syntax highlighted HTML version of your code in a new buffer. It generates a html based on your vim colorscheme. To change the colorscheme, use the :colorscheme <name> command.

edited Sep 21, 2011 at 23:08

answered Sep 21, 2011 at 22:57

jman

11.7k5 gold badges41 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Karoly Horvath Over a year ago

@Qiang Li: any PHP, Python or JS syntax highlight plugin will do

jman Over a year ago

Edited answer for the reverse direction.

Qiang Li Over a year ago

@yi_H, can you please be more specific, for example, what is Python's?

Qiang Li Over a year ago

@yi_H: do you mind telling me what formatter did you use to give the nice coloring?

Karoly Horvath Over a year ago

sigh if you can't figure that out you should seek for another job.

JRL · Accepted Answer · 2011-09-22 01:08:47Z

2

PHP script:

$doc = new DOMDocument();
$doc->loadHTMLFile("file.html");
$xpath = new DOMXpath($doc);
$str = '';
foreach ($xpath->query("//dl//text()") as $node) {
    $str .= $node->nodeValue . ' ';
}

file_put_contents('file.cpp', $str);

contents of file.cpp:

   template  < typename  G>
   class  Components :  public  DFS<G> {             // count components 
   private :
     int  nComponents;                  // num of components 
   public :
     Components ( const  G& g): DFS<G>(g) {}         // constructor 
     int   operator ()();                  // count components 
  };

answered Sep 22, 2011 at 1:08

JRL

78.2k18 gold badges103 silver badges146 bronze badges

Comments

executifs · Accepted Answer · 2011-09-21 23:05:20Z

1

You could use regular expressions to...

...keep only what's in the <body> of the HTML page,
...strip all the HTML tags (everything that looks like <.*> should be removed from the file).
...unescape special characters such as <, >, & etc.

What's left should be the code you're looking for.

edited Sep 21, 2011 at 23:05

answered Sep 21, 2011 at 22:59

executifs

1,1881 gold badge9 silver badges25 bronze badges

Comments

Matteo Italia · Accepted Answer · 2011-09-21 23:37:32Z

1

Another option for going from HTML to the source code is the html2text utility, that is often found installed in many Linux distributions.

matteo@teomint:~/Desktop$ html2text out.html 
***** Code Fragment: Class *****


        template <typename G>
        class Components : public DFS<G> {            // count components
        private:
          int nComponents;                 // num of components
        public:
          Components(const G& g): DFS<G>(g) {}        // constructor
          int operator()();                 // count components
        };

answered Sep 21, 2011 at 23:37

Matteo Italia

128k18 gold badges219 silver badges313 bronze badges

Comments

Lightness Races in Orbit · Accepted Answer · 2011-09-21 22:58:07Z

0

Fix the HTML. You're missing some closing tags.
Get PHP out
- Obtain the pre code block with DOMDocument
- strip_tags() from the result
Profit.

answered Sep 21, 2011 at 22:58

Lightness Races in Orbit

387k77 gold badges670 silver badges1.1k bronze badges

Comments

kestrel · Accepted Answer · 2011-09-21 22:58:14Z

0

If you're trying to strip all HTML tags to get back the original, non-highlighted source code, then you have a two options that I can think of:

Parse the DOM tree and just grab all relevant text.
Use some regular expressions to remove the tags themselves. For example, maybe "s///" would be a good start?

answered Sep 21, 2011 at 22:58

kestrel

1,35410 silver badges31 bronze badges

Collectives™ on Stack Overflow

Get the source code from an html file

6 Answers 6

5 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related