An even smarter html_truncate tag
As you know I am using Jekyll to generate this blog. When setting mine up I read how Jack Moffitt set up his Jekyll installation and thought the idea of an html_truncate filter was pretty cool.
What does a truncate filter do? Just make a string
shorter. Here is an example from the truncate documentation included in Ruby on Rails:
truncate("Once upon a time in a world far far away")
# yeilds => Once upon a time in a world f...
Now, this is all well and good unless you have some HTML in there:
truncate("Once upon a time in a world <b>far far away</b>")
# yeilds => Once upon a time in a world <...
Which is no good, since it has split our <b>
tag in two. Worse would be if it got the whole <b>
opening tag but missed the closing tag; the whole rest of the page would be bolded. You’ll also notice that this could cut words completely in half.
What is needed is a truncate that won’t truncate tags or words, and won’t leave tags unclosed.
Here is [Jack Moffit’s html_truncate
filter]metajack_truncate (from the GitHub commit):
def html_truncatewords(input, words = 15, truncate_string = "...")
doc = Hpricot.parse(input)
(doc/:"text()").to_s.split[0..words].join(' ') + truncate_string
end
What his does is send the string to the Hpricot HTML parser which strips out all the HTML tags. It then splits the string up into just the words, and returns the first however many words requested. To continue on with our previous example:
html_truncatewords("Once upon a time in a world <b>far far away</b>", 8)
# yeilds => Once upon a time in a world far...
So, we solved all of our complaints! No more broken up HTML tags and no more split words. But I wasn’t sure if I liked this result. Where did the HTML tags go? I wanted an HTML truncate that returned the first however many words while maintaining the HTML tags.
Here is my algortithm:
- Load in the HTML
- Traverse the loaded HTML looking for text nodes
- When a text node is found count the number of words it has
- Once the limit is reached, remove all nodes that come after it.
Here is my code:
Update: I have learned that there are some errors in this code. GitHub user Eleo has posted a working version. Thanks Eleo!
def html_truncate(input, num_words = 15, truncate_string = "...")
doc = Nokogiri::HTML(input)
current = doc.children.first
count = 0
while current != nil
# we found a text node
if current.class == Nokogiri::XML::Text
count += current.text.split.length
# we reached our limit, let's get outta here!
break if count > num_words
end
if current.children.length > 0
# this node has children, can't be a text node,
# lets descend and look for text nodes
current = current.children.first
elsif not current.next.nil?
#this has no children, but has a sibling, let's check it out
current = current.next
else
# we are the last child, we need to ascend until we are
# either done or find a sibling to continue on to
n = current
while n.parent.next.nil? and n != doc
n = n.parent
end
if n == doc
current = nil
else
current = n.parent.next
end
end
end
if count >= num_words
new_content = current.text.split(/ /)
# the most confusing part. we want to grab just the first [num_words]
# number of words, but this last text node could send us way over
# our limit. So, we need to find the difference between the number
# of words we wanted and the number of words total we found (count - num_words)
# to find how many we need to take off of this last text node
# so we subtract from the number of words in this text node.
# Finally we add 1 because we are doing a range and we need to get the index right.
new_content = new_content[0..(new_content.length-(count-num_words)+1)]
current.content= new_content.join(' ') + truncate_string
#remove everything else
while current != doc
while not current.next.nil?
current.next.remove
end
current = current.parent
end
end
# now we grab the html and not the text.
# we do first because nokogiri adds html and body tags
# which we don't want
doc.children.first.children.first.inner_html
end
I used the Nokogiri HTML parser because I read that it was faster. (Now I am reading that is no longer the case! Which one will I choose?)
And finally, here is what my html_truncate
function will does:
html_truncate("Once upon a time in a world <b>far far away</b>", 8)
# yeilds => Once upon a time in a world <b>far...</b>
No split words, no broken HTML. Perfect!
Though, I ultimately decided not to use it, and went with Jack’s function. I liked how concise it made the resulting text with no <p>
’s or <ul>
’s to string it out. I still think my version is useful though.