Django urlize HTML safe

The default django urlize filter is not HTML safe as the docs say.
Note that if urlize is applied to text that already contains HTML markup, things won't work as expected. Apply this filter only to plain text.
The easy way to solve this problem is to use an HTML parser and make sure the filter is applied only to plain text.
BeautifulSoup functions
To get the HTML tags from a text, .findAll() is the way to go, but if we need the plain text, .contents is what we need. Given a simple example to see what is the difference:
- from BeautifulSoup import BeautifulSoup
- html = '<p>foo <a href="http://od-eon.com">http://od-eon.com</a> bar</p> foobar'
- soup = BeautifulSoup(html)
- soup.findAll()
- >>> [<p>foo <a href="http://od-eon.com">http://od-eon.com</a> bar</p>, <a href="http://od-eon.com">http://od-eon.com</a>]
- soup.contents
- >>> [<p>foo <a href="http://od-eon.com">http://od-eon.com</a> bar</p>, u' foobar']
The flow is clear to obtain a safe urlize from this simple example. We need to apply the urlize function on the .contents of each tag that the text contains:
Step 1. iterate over .contents and if its not a tag, do the replacement
Step 2. iterate over the tags, create a new BeautifulSoup instance from each tag
Step 3. if the new BeautifulSoup.findAll() returns more tags, for each of them repeat the process; if it doesn't, apply the urlize function and replace the tag in the main soup object
There is only one minor check that we should add in this case, what kind of tag are we converting now, because the http://od-eon.com inside the anchor tag it shouldn't convert to a new link, it already is one. So at Step 2. we need to check if tag.name is one of our accepted tags.
Here's how the code would look like:
- from django.template.defaultfilters import stringfilter
- from django.template import Library
- from django.utils.html import urlize
- register = Library()
- def html_urlize(value, autoescape=None):
- """Converts URLs in plain text into clickable links."""
- from BeautifulSoup import BeautifulSoup
- ignored_tags = ['a', 'code', 'pre']
- soup = BeautifulSoup(value)
- tags = soup.findAll(True)
- text_all = soup.contents
- for text in text_all:
- if text not in tags:
- parsed_text = urlize(text, nofollow=True, autoescape=autoescape)
- text.replaceWith(parsed_text)
- for tag in tags:
- if not tag.name in ignored_tags:
- soup_text = BeautifulSoup(str(tag))
- if len(soup_text.findAll()) > 1:
- for child_tag in tag.contents:
- child_tag.replaceWith(html_urlize(str(child_tag)))
- elif len(soup_text.findAll()) > 0:
- text_list = soup_text.findAll(text=True)
- for text in text_list:
- parsed_text = urlize(text, nofollow=True, autoescape=autoescape)
- text.replaceWith(parsed_text)
- try:
- tag.replaceWith(str(soup_text))
- except:
- pass
- return mark_safe(str(soup))
- html_urlize.is_safe = True
- html_urlize.needs_autoescape = True
- html_urlize = stringfilter(html_urlize)
- register.filter(html_urlize)
Category: Django



Discussion
Thanks for this snippet :)
This is exactly what I was looking for.
Leave a Comment :
Leave a Comment