We can add arbitrary characters in a XML document by using CDATA section. But CDATA also has limits to allow only character data. Any text not falling in the character range will result in issues when parsing the XML. Usually an exception with message as “An invalid XML character (Unicode: 0x1a) was found in the CDATA section” is thrown in this case.

Ideally, one should look for the reason for insertion of such data in XML because XML is not supposed to handle such binary characters in CDATA section. These in-valid characters should be encoded using some algorithm like Base64. But still there are some cases where you want to just go ahead and skip these special characters from XML before parsing the xml. Whether using SAX or DOM parser for parsing the XML, you need to feed an XML which is free from these characters.

There is no way by which the parsers (be it SAX or DOM) can ignore these binary special characters on their own. You need some mechanism to strip them off. The following java code shows how we can remove the special non-Unicode characters from an XML file and the parse the final string to a parser:


public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {

	DefaultHandler handler = new DefaultHandler() {

		public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
			......
		}
		
		public void endElement(String uri, String localName, String qName) throws SAXException {
			......
		}
		
		public void characters(char ch[], int start, int length) throws SAXException {
			......
		}
	);

	File f = new File("c:\\jexp.xml");
	FileReader fr = new FileReader(f);
	BufferedReader br = new BufferedReader(fr);
	String line = "";
	StringBuilder contents = new StringBuilder();
	while((line = br.readLine())!=null) {
		contents.append(line + "\n");
	}
	String strippedcontents =  stripNonValidXMLCharacters(contents.toString());
	StringReader sr = new StringReader(strippedcontents);
	InputSource is = new InputSource(sr);
	saxParser.parse(is, handler); // 
}		

public static String stripNonValidXMLCharacters(String in) {
	StringBuilder out = new StringBuilder(); 
	char current;

	if (in == null || ("".equals(in))) return "";
		for (int i = 0; i < in.length(); i++) {
			current = in.charAt(i);
			if ((current == 0x9) ||
				(current == 0xA) ||
				(current == 0xD) ||
				((current >= 0x20) && (current <= 0xD7FF)) ||
				((current >= 0xE000) && (current <= 0xFFFD)) ||
				((current >= 0x10000) && (current <= 0x10FFFF)))
				out.append(current);
			}
			return out.toString();
	}

In the above code, we are first scanning the file for invalid xml characters and ignoring them. Rest of the valid characters are getting added to a String object. The parse methods from any Parser instance don’t accept String object as argument. So we need to either write the string back to file or create a stream from the String object. The above code feeds an InputSource instance from StringReader to the parse method of SAX parser.

Strip invalid characters from XML admin Core Java
We can add arbitrary characters in a XML document by using CDATA section. But CDATA also has limits to allow only character data. Any text not falling in the character range will result in issues when parsing the XML. Usually an exception with message as 'An invalid XML character...
We can add arbitrary characters in a XML document by using CDATA section. But CDATA also has limits to allow only character data. Any text not falling in the character range will result in issues when parsing the XML. Usually an exception with message as "An invalid XML character (Unicode: 0x1a) was found in the CDATA section" is thrown in this case. Ideally, one should look for the reason for insertion of such data in XML because XML is not supposed to <strong>handle such binary characters</strong> in CDATA section. These in-valid characters should be encoded using some algorithm like Base64. But still there are some cases where you want to just go ahead and <strong>skip these special characters from XML</strong> before parsing the xml. Whether using SAX or DOM parser for parsing the XML, you need to feed an XML which is free from these characters. There is no way by which the parsers (be it SAX or DOM) can ignore these binary special characters on their own. You need some mechanism to strip them off. The following java code shows how we can remove the special non-Unicode characters from an XML file and the parse the final string to a parser: 1 In the above code, we are first scanning the file for invalid xml characters and ignoring them. Rest of the valid characters are getting added to a String object. The parse methods from any Parser instance don't accept String object as argument. So we need to either write the string back to file or create a stream from the String object. The above code feeds an InputSource instance from StringReader to the parse method of SAX parser.
The following two tabs change content below.
I run this blog with lots of passion. In this website, you will find tutorials on Core Java, Spring, Struts, Web Applications, Portals and Database. Please support me and the website by sharing the posts on your facebook / twitter. You can tap the share button at the top of each post. Thanks for the support.