In: Computer Science
How do i remove all javacript text from a block of text in python
for example
Removal of any HTML tags – any text between a < character and a >you can assume is a HTML tag and needs to be removed.. Removal of JavaScript code – before you remove your HTML tags above, you will also need to remove any text that is between the <script> or </script> tags "Note that a <script> or </script>tag can have any amounts of whitespace or other text between the "" character and valid script tag that must be removed.
Im also not allowed to import any module form the python library
Question: How do i remove all javascript text from a block of text in Python
Sol:
Removing javascript in a python string is a common operation if you have crawled a web page. We can remove all the javscript text from a block of text in Python by using Regular Expressions.
For example:
Create a text contains javascript code
text = ' ' '
this is a script test.
<Script type = "text/javascript">
alert( 'test' )
</script>
test is end.
' ' '
Build regular expression to remove javascript code
re_script = re.compile( ' <\s*script[^>]*>.*?<\s*/\s*script\s*>', re.S | re.I )
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The Python module re provides full support for Perl-like regular expressions in Python. Your regular expression (For example) <\s*script[^>]*>[^<]*<\s*/\s*script\s*> should not have the [^<]*. You should reserve that just for matching tags themselves. Instead you should use the non-greedy *, usually syntactically denoted as: *?. Some characters, like '|' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. The re module raises the exception re.error if an error occurs while compiling or using a regular expression. re.S Makes a period (dot) match any character, including a newline. re.I performs case-insensitive matching.
Remove javascript code
text = re_script.sub(' ', text)
re_script_sub is used to remove the Script. Run this python
script, you will find they are removed, the result is:
Output:
this is a script test
test is end