.Net Helper: Reading a PDF file content using iTextSharp

Thursday, December 3, 2009

Reading a PDF file content using iTextSharp

hi there,

iTextSharp has made reading the DLL so easy, that using a simple function we can read the content of the PDF file. We just need to use this dll, this can be easily downloaded from the following sites.

SourceForge

WebScript

Using the following fuction we can read the PDF file by specifying the page no.. here is the code.

using iTextSharp.text.pdf;

string filename = "FileName";
PdfReader reader = new PdfReader(filename);
int intPageCount = reader.NumberOfPages;
string strContent = ParsePdfText(filename, 1, intPageCount);

public string ParsePdfText(string sourcePDF, int fromPageNum,int toPageNum)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
try
{
PdfReader reader = new PdfReader(sourcePDF);
byte[] pageBytes = null;
PRTokeniser token = null;
int tknType = -1;
string tknValue = string.Empty;
for (int i = fromPageNum; i <= toPageNum; i += 1)
{
pageBytes = reader.GetPageContent(i);
if ((pageBytes != null))
{
token = new PRTokeniser(pageBytes);
while (token.NextToken())
{
tknType = token.TokenType;
tknValue = token.StringValue;
if (tknType == PRTokeniser.TK_STRING)
{
sb.Append(token.StringValue);
}

else if (tknType == 1 && tknValue == "-600")
{
sb.Append(" ");
}
else if (tknType == 10 && tknValue == "TJ")
{
sb.Append(" ");
}
}
}
}
}
catch (Exception ex)
{
MessageBox.Show("Exception occured. " + ex.Message);
return string.Empty;
}
return sb.ToString();
}

The varibale strContent will contain all the data present in the PDF file. using the starting page no and End page no we can spicy the limit from which page to which we want to read the content.

Isn't that so easy.. waiting for all your comment..

Thanks
Anil Kumar Pandey
System Architect, MVP
Mumbai, Maharshtra

11 comments:

AnonymousApril 8, 2010 at 2:44 PM
Hii, i am getting error as
"parsepdftext doesnot contain in the current context" i included namespace also
ReplyDelete
Replies
UnknownApril 8, 2010 at 2:51 PM
The code is Hidden i Guess let me paste it again
ReplyDelete
Replies
HarikrishnaApril 8, 2010 at 3:26 PM
Thanq but still i am getting 2 errors
1. token.TokenType (Error: type casting here i converted tokentype to int so prob gets solved)
2.PRTokeniser.TK_STRING

(PRTokeniser does not contain TK_STRING)

plz let me knw the solution for those error

Thanq
ReplyDelete
Replies
HarikrishnaApril 8, 2010 at 3:36 PM
Thanq but still i am getting 2 errors
1. token.TokenType (Error: type casting here i converted tokentype to int so prob gets solved)
2.PRTokeniser.TK_STRING

(PRTokeniser does not contain TK_STRING)

if i romove those if condtions some text is adding before new line

plz let me knw the solution for those error

Thanq
ReplyDelete
Replies
MuthukumarSeptember 16, 2010 at 10:56 PM
hiee,

ur code works 5n.

But the output what i m getting is not readable.
ReplyDelete
Replies
UnknownSeptember 17, 2010 at 8:11 PM
You might passing Wrong no of pages.
Please Debug the solution and check!!!
ReplyDelete
Replies
confusedDecember 27, 2012 at 2:21 PM
dear anil kumar,
please help with harikrishna's error. I am getting the same error even after several efforts to debug. Thank you.
ReplyDelete
Replies
UnknownJanuary 15, 2013 at 11:42 PM
You might not using the correct name space as there is no issue in the code. Please try adding the name space correctly.
ReplyDelete
Replies
ZarfishanOctober 25, 2013 at 3:24 PM
I use Aspose.PDF for .NET for reading PDF file and also to manage my pdf files. I am not having any issue with this library and i am very satisfied with its result.
ReplyDelete
Replies

Add comment

.Net Helper

Pages

Thursday, December 3, 2009

Reading a PDF file content using iTextSharp

11 comments:

Kontera

Kontera