Now, this is all well and good unless you have some HTML in there:
Which is no good, since it has split our <b> tag in two. Worse would be if it got the whole <b> opening tag but missed the closing tag; the whole rest of the page would be bolded. You’ll also notice that this could cut words completely in half.
What is needed is a truncate that won’t truncate tags or words, and won’t leave tags unclosed.
What his does is send the string to the Hpricot HTML parser which strips out all the HTML tags. It then splits the string up into just the words, and returns the first however many words requested. To continue on with our previous example:
So, we solved all of our complaints! No more broken up HTML tags and no more split words. But I wasn’t sure if I liked this result. Where did the HTML tags go? I wanted an HTML truncate that returned the first however many words while maintaining the HTML tags.
Here is my algortithm:
Load in the HTML
Traverse the loaded HTML looking for text nodes
When a text node is found count the number of words it has
Once the limit is reached, remove all nodes that come after it.
And finally, here is what my html_truncate function will does:
No split words, no broken HTML. Perfect!
Though, I ultimately decided not to use it, and went with Jack’s function. I liked how concise it made the resulting text with no <p>’s or <ul>’s to string it out. I still think my version is useful though.