Pages

Thursday, December 3, 2009

Reading a PDF file content using iTextSharp

hi there,
iTextSharp has made reading the DLL so easy, that using a simple function we can read the content of the PDF file. We just need to use this dll, this can be easily downloaded from the following sites.


Using the following fuction we can read the PDF file by specifying the page no.. here is the code.



using iTextSharp.text.pdf;


string filename = "FileName";
PdfReader reader
= new PdfReader(filename);
int intPageCount = reader.NumberOfPages;
string strContent = ParsePdfText(filename, 1, intPageCount);



public string ParsePdfText(string sourcePDF, int fromPageNum,int toPageNum)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
try
{
PdfReader reader
= new PdfReader(sourcePDF);
byte[] pageBytes = null;
PRTokeniser token
= null;
int tknType = -1;
string tknValue = string.Empty;
for (int i = fromPageNum; i <= toPageNum; i += 1)
{
pageBytes
= reader.GetPageContent(i);
if ((pageBytes != null))
{
token
= new PRTokeniser(pageBytes);
while (token.NextToken())
{
tknType
= token.TokenType;
tknValue
= token.StringValue;
if (tknType == PRTokeniser.TK_STRING)
{
sb
.Append(token.StringValue);
}

else if (tknType == 1 && tknValue == "-600")
{
sb
.Append(" ");
}
else if (tknType == 10 && tknValue == "TJ")
{
sb
.Append(" ");
}
}
}
}
}
catch (Exception ex)
{
MessageBox
.Show("Exception occured. " + ex.Message);
return string.Empty;
}
return sb.ToString();
}

The varibale strContent will contain all the data present in the PDF file. using the starting page no and End page no we can spicy the limit from which page to which we want to read the content.

Isn't that so easy.. waiting for all your comment..

Thanks
Anil Kumar Pandey
System Architect, MVP
Mumbai, Maharshtra

11 comments:

  1. Hii, i am getting error as
    "parsepdftext doesnot contain in the current context" i included namespace also

    ReplyDelete
  2. The code is Hidden i Guess let me paste it again

    ReplyDelete
  3. Thanq but still i am getting 2 errors
    1. token.TokenType (Error: type casting here i converted tokentype to int so prob gets solved)
    2.PRTokeniser.TK_STRING

    (PRTokeniser does not contain TK_STRING)

    plz let me knw the solution for those error

    Thanq

    ReplyDelete
  4. Thanq but still i am getting 2 errors
    1. token.TokenType (Error: type casting here i converted tokentype to int so prob gets solved)
    2.PRTokeniser.TK_STRING

    (PRTokeniser does not contain TK_STRING)

    if i romove those if condtions some text is adding before new line

    plz let me knw the solution for those error

    Thanq

    ReplyDelete
  5. hiee,

    ur code works 5n.

    But the output what i m getting is not readable.

    ReplyDelete
  6. You might passing Wrong no of pages.
    Please Debug the solution and check!!!

    ReplyDelete
  7. dear anil kumar,
    please help with harikrishna's error. I am getting the same error even after several efforts to debug. Thank you.

    ReplyDelete
  8. You might not using the correct name space as there is no issue in the code. Please try adding the name space correctly.

    ReplyDelete
  9. I use Aspose.PDF for .NET for reading PDF file and also to manage my pdf files. I am not having any issue with this library and i am very satisfied with its result.

    ReplyDelete
    Replies
    1. Yes Aspose is one of the common used tool for PDF management a part from Adobe itself.

      Delete

Kontera