HTML DOM parsing

This forum is for all Flare issues related to importing files or projects.
Post Reply
Johanna
Propeller Head
Posts: 25
Joined: Wed Feb 25, 2015 3:08 am

HTML DOM parsing

Post by Johanna »

We use MADCAP Flare but we receive from partners documents in various formats:
  • Google Doc
  • Doxygen
  • Latex
  • Sphynx
From these documents I can save HTML or MS Word documents but the tags are not at all clean.

For example: I want to replace <div class="math notranslate nohighlight">\r\n\\\[(.*(?=(<\/div>)))</div>
by <p class="Equa"><MadCap:equation>$ \1 $</MadCap:equation></p>

But it requires quite some knowledge on regular expressions to figure out how to do this in Find and replace. And it is not fool proof as it may not catch always the desired string.
E.g. if I have a <div> tag inside another <div> tag, I may catch the wrong closing tag.
I read on internet that regular expressions are not really very suitable to parse a xml/ html document as it does not easily find the relation between start and closing tags.
Then I read a DOM Html parser is more suitable but it seems complicated as well.

Also I would like to replace a div tag of a certain class with a paragraph tag but I can't find the way to do this. Again, a regular expression through Find and replace is not obvious.

Anybody here has experience with cleaning up dirty xml?

Best regards,
Colinda
Johanna
Propeller Head
Posts: 25
Joined: Wed Feb 25, 2015 3:08 am

Re: HTML DOM parsing

Post by Johanna »

Please ignore my question: I realized that Flare 2020 does come with a Find and replace of elements! :D
Finally! That is what I needed.
Post Reply