Thursday, December 3, 2009

Reading a PDF file content using iTextSharp

hi there,
iTextSharp has made reading the DLL so easy, that using a simple function we can read the content of the PDF file. We just need to use this dll, this can be easily downloaded from the following sites.


Using the following fuction we can read the PDF file by specifying the page no.. here is the code.



using iTextSharp.text.pdf;


string filename = "FileName";
PdfReader reader
= new PdfReader(filename);
int intPageCount = reader.NumberOfPages;
string strContent = ParsePdfText(filename, 1, intPageCount);



public string ParsePdfText(string sourcePDF, int fromPageNum,int toPageNum)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
try
{
PdfReader reader
= new PdfReader(sourcePDF);
byte[] pageBytes = null;
PRTokeniser token
= null;
int tknType = -1;
string tknValue = string.Empty;
for (int i = fromPageNum; i <= toPageNum; i += 1)
{
pageBytes
= reader.GetPageContent(i);
if ((pageBytes != null))
{
token
= new PRTokeniser(pageBytes);
while (token.NextToken())
{
tknType
= token.TokenType;
tknValue
= token.StringValue;
if (tknType == PRTokeniser.TK_STRING)
{
sb
.Append(token.StringValue);
}

else if (tknType == 1 && tknValue == "-600")
{
sb
.Append(" ");
}
else if (tknType == 10 && tknValue == "TJ")
{
sb
.Append(" ");
}
}
}
}
}
catch (Exception ex)
{
MessageBox
.Show("Exception occured. " + ex.Message);
return string.Empty;
}
return sb.ToString();
}

The varibale strContent will contain all the data present in the PDF file. using the starting page no and End page no we can spicy the limit from which page to which we want to read the content.

Isn't that so easy.. waiting for all your comment..

Thanks
Anil Kumar Pandey
System Architect, MVP
Mumbai, Maharshtra

Kontera