Saturday, July 20, 2013

itextsharp read pdf file

Referred URL - http://www.codeproject.com/Questions/341142/itextsharp-read-pdf-file

What do you mean by read the PDF file? I'm not kidding asking this question because it's important to understand that a PDF file isn't a structured file. In other words, you can't say that you can retrieve a paragraph, for instance, just by reading some strings. Plus, do you want to consider image data in this as well? About the best that you can do is something like this:
public string ParsePdf(string fileName)
{
  if (!File.Exists(fileName))
    throw new FileNotFoundException("fileName");
  using (PdfReader reader = new PdfReader(fileName))
  {
    StringBuilder sb = new StringBuilder();
 
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    for (int page = 0; page < reader.NumberOfPages; page++)
    {
      string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
      if (!string.IsNullOrWhitespace(text))
      {
        sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
      }
    }
 
    return sb.ToString();
  } 
 }
}

No comments: