Getting rid of </span><span+> from word imports

This forum is for all Flare issues related to importing files or projects.
Post Reply
owilkes
Propeller Head
Posts: 68
Joined: Wed Apr 20, 2011 10:01 am
Location: London

Getting rid of </span><span+> from word imports

Post by owilkes »

Hi All,

I assume the problem is with word (it usually is) - I'm starting up a big conversion task from word to MadCap, so want all ducks lined up beforehand/processes clear. When I import docx into a project, the converted output is generally ok - but often there are a bunch of </span><span+>s in the middle of the text - right in the middle of the paragraph. I have no idea why, or whether this will cause any harm.

I imagine the problem is with Word keepng some secret, pointless bloatware formatting that cannot be erased.

Any clues / tricks anyone has found about how to 'simplify' a word docx, so if it doesn't look like there's any changes in the middle of the text, there aren't any, so when it is converted into MadCap, it doesn't have unnecessary <span> formatting?
Many thanks.
Andrew
Propellus Maximus
Posts: 1237
Joined: Fri Feb 10, 2006 5:37 am

Re: Getting rid of </span><span+> from word imports

Post by Andrew »

Beyond cracking open the docx XML files and editing the XML directly, I don't know of a way (and, as you might guess, that is harder than actually removing them in Flare).

Actually, if you are familiar with Regular Expressions (RegEx), you might be able to simply remove all <span></span> tags with no content between them. Flare has a built-in RegEx tool, or you could use a program like FAR (Find And Replace).
Flare v6.1 | Capture 4.0.0
RamonS
Senior Propellus Maximus
Posts: 4293
Joined: Thu Feb 02, 2006 9:29 am
Location: The Electric City

Re: Getting rid of </span><span+> from word imports

Post by RamonS »

The spans are typically from inline formatting done in Word. Word makes this extremely easy and encourages it. Many Word users also just shake their head at strictly adhering to style based styling. One option would be to clean up the Word document first, then import. You could also do the cleanup after import in Flare. Not sure which way is easier.
Nita Beck
Senior Propellus Maximus
Posts: 3667
Joined: Thu Feb 02, 2006 9:57 am
Location: Pittsford, NY

Re: Getting rid of </span><span+> from word imports

Post by Nita Beck »

Andrew wrote:Actually, if you are familiar with Regular Expressions (RegEx), you might be able to simply remove all <span></span> tags with no content between them. Flare has a built-in RegEx tool, or you could use a program like FAR (Find And Replace).
There is another way, too, if you have Analyzer 4. Use the Markup Suggestions feature, which can find empty tags (among other markup issues) in all the files in your project and prompt if you want to remove them.
Nita
Image
RETIRED, but still fond of all the Flare friends I've made. See you around now and then!
owilkes
Propeller Head
Posts: 68
Joined: Wed Apr 20, 2011 10:01 am
Location: London

Re: Getting rid of </span><span+> from word imports

Post by owilkes »

Thanks for all the suggestions. I'll go through each, to the best of my ability (warning, this could kick off further questions). We are also evaluating a product 'DataExtractor' which apparently strips out unwanted code, will let you know how I get on - can't believe I'm the only one to have this migration problem.

thanks again.
Nita Beck
Senior Propellus Maximus
Posts: 3667
Joined: Thu Feb 02, 2006 9:57 am
Location: Pittsford, NY

Re: Getting rid of </span><span+> from word imports

Post by Nita Beck »

owilkes wrote:...can't believe I'm the only one to have this migration problem.
You're not. There is always some kind of code cleanup that has to be done after importing content from elsewhere, whether Word or FrameMaker or RoboHelp. Not everyone sees exactly the same issue you're seeing, but I would bet that everyone sees some kind of issue that has to be cleaned up post migration. And sometimes the trick is to clean up content in the source application before pulling it into Flare.
Nita
Image
RETIRED, but still fond of all the Flare friends I've made. See you around now and then!
rob hollinger
Propellus Maximus
Posts: 661
Joined: Mon Mar 17, 2008 8:40 am

Re: Getting rid of </span><span+> from word imports

Post by rob hollinger »

In Flare 7, we fixed the "Remove inline formatting" tool which is on the Text Formatting toolbar.
Its the Bold underline "B".

Open a topic that's full of span tags you want to remove.
Click CTRL + A to select all.
Click the button and they are all removed.
Tip: For really long topics, be sure your in Layout(Web) to avoid kicking off the pagination engine.
Rob Hollinger
MadCap Software
jasonsmith
Sr. Propeller Head
Posts: 205
Joined: Wed Apr 28, 2010 2:51 am

Re: Getting rid of </span><span+> from word imports

Post by jasonsmith »

Another thing to do is to make sure that "Preserve MS Word Styles" is cleared when you import your Word document...
owilkes
Propeller Head
Posts: 68
Joined: Wed Apr 20, 2011 10:01 am
Location: London

Re: Getting rid of </span><span+> from word imports

Post by owilkes »

Many thanks - the Remove Inline Formatting tool seems to have worked (will need to do more checking, but seems to do the trick).

Always nice (and rare) to try something new, and it do exactly what you want, first time!

thanks again
TheBrittleTechWriter
Propeller Head
Posts: 28
Joined: Wed Feb 15, 2006 12:58 pm
Location: Chicago, IL
Contact:

Re: Getting rid of </span><span+> from word imports

Post by TheBrittleTechWriter »

The Unformat (Remove Inline formatting) button works well with one or two instances, but importing Word you may get literally hundreds of these pairs. Consider using regex. In a tool like NotePad++, you can :
Find what: <span class="span_[0-9]">(.*?)</span>
Replace with: \1
or variations. This does a great job at cleaning up the mess.

Why Word does this in the first place is unclear. Likely the Word user used inline formatting at some point and Word doesn't always clean up its internal code properly. You can't see this in Word, and although removing a selection's formatting and then reapplying it works, it's hardly practical. Not with detailed formatting and not with hundreds of documents. The best you can do is handle it with Flare afterwards. On the other hand, using NotePad++'s regex feature with multiple files option, you can clean an unlimited number of documents in a few seconds. You can also use it to record a macro, making it even easier.
SKamprowski
Sr. Propeller Head
Posts: 277
Joined: Fri Feb 13, 2015 8:25 am
Location: Germany

Re: Getting rid of </span><span+> from word imports

Post by SKamprowski »

Hi,
Here is just an old "trick" to remove inline formatting in Word:
mark all text, then press ctrl+space
Kind regards,
Sabine Kamprowski
DocToHelp MVP (by ComponentOne)
Post Reply