An even smarter html_truncate tag

January 30th, 2009

As you know I am using Jekyll to generate this blog. When setting mine up I read how Jack Moffitt set up his Jekyll installation and thought the idea of an html_truncate filter was pretty cool.

What does a truncate filter do? Just make a string shorter. Here is an example from the truncate documentation included in Ruby on Rails:

truncate("Once upon a time in a world far far away")
# yeilds => Once upon a time in a world f...

Now, this is all well and good unless you have some HTML in there:

truncate("Once upon a time in a world <b>far far away</b>")
# yeilds => Once upon a time in a world <...

Which is no good, since it has split our <b> tag in two. Worse would be if it got the whole <b> opening tag but missed the closing tag; the whole rest of the page would be bolded. You’ll also notice that this could cut words completely in half.

What is needed is a truncate that won’t truncate tags or words, and won’t leave tags unclosed.

Here is [Jack Moffit’s html_truncate filter]metajack_truncate (from the GitHub commit):

def html_truncatewords(input, words = 15, truncate_string = "...")
	doc = Hpricot.parse(input)
	(doc/:"text()").to_s.split[0..words].join(' ') + truncate_string
end

What his does is send the string to the Hpricot HTML parser which strips out all the HTML tags. It then splits the string up into just the words, and returns the first however many words requested. To continue on with our previous example:

html_truncatewords("Once upon a time in a world <b>far far away</b>", 8)
# yeilds => Once upon a time in a world far...

So, we solved all of our complaints! No more broken up HTML tags and no more split words. But I wasn’t sure if I liked this result. Where did the HTML tags go? I wanted an HTML truncate that returned the first however many words while maintaining the HTML tags.

Here is my algortithm:

Load in the HTML
Traverse the loaded HTML looking for text nodes
When a text node is found count the number of words it has
Once the limit is reached, remove all nodes that come after it.

Here is my code:

Update: I have learned that there are some errors in this code. GitHub user Eleo has posted a working version. Thanks Eleo!

def html_truncate(input, num_words = 15, truncate_string = "...")
	doc = Nokogiri::HTML(input)

	current = doc.children.first
	count = 0

	while current != nil
		# we found a text node
		if current.class == Nokogiri::XML::Text
			count += current.text.split.length
			# we reached our limit, let's get outta here!
			break if count > num_words
		end

		if current.children.length > 0
			# this node has children, can't be a text node,
			# lets descend and look for text nodes
			current = current.children.first
		elsif not current.next.nil?
			#this has no children, but has a sibling, let's check it out
			current = current.next
		else 
			# we are the last child, we need to ascend until we are
			# either done or find a sibling to continue on to
			n = current
			while n.parent.next.nil? and n != doc
				n = n.parent
			end

			if n == doc
				current = nil 
			else
				current = n.parent.next
			end
		end
	end

	if count >= num_words
		new_content = current.text.split(/ /)

		# the most confusing part. we want to grab just the first [num_words]
		# number of words, but this last text node could send us way over
		# our limit.  So, we need to find the difference between the number
		# of words we wanted and the number of words total we found (count - num_words)
		# to find how many we need to take off of this last text node
		# so we subtract from the number of words in this text node.
		# Finally we add 1 because we are doing a range and we need to get the index right.
		new_content = new_content[0..(new_content.length-(count-num_words)+1)]

		current.content= new_content.join(' ') + truncate_string

		#remove everything else
		while current != doc
			while not current.next.nil?
				current.next.remove
			end
			current = current.parent
		end
	end

	# now we grab the html and not the text.
	# we do first because nokogiri adds html and body tags
	# which we don't want
	doc.children.first.children.first.inner_html
end

I used the Nokogiri HTML parser because I read that it was faster. (Now I am reading that is no longer the case! Which one will I choose?)

And finally, here is what my html_truncate function will does:

html_truncate("Once upon a time in a world <b>far far away</b>", 8)
# yeilds => Once upon a time in a world <b>far...</b>

No split words, no broken HTML. Perfect!

Though, I ultimately decided not to use it, and went with Jack’s function. I liked how concise it made the resulting text with no <p>’s or <ul>’s to string it out. I still think my version is useful though.