• 0 Posts
  • 276 Comments
Joined 2 years ago
cake
Cake day: July 9th, 2023

help-circle

  • This depends on what you are actually looking for, and how you are looking for it.

    Do you really need pattern matching, or do you only look for fixed strings? Then other tools may be faster.

    If you need case independent search on an upper- and lowercase data set, make a copy that is all upper or all lower, and search there.

    If you only search in certain columns, make a copy that only includes these.

    Or import the data into a database.










  • The Iranians would have been terminally stupid if they hadn’t moved out anything that’s not bolted down (and even some that is) from the known locations in the days before the attacks. The IDF was openly demanding the US to bomb those sites, so they knew they were in the crosshairs. And if the only wrapped it up and buried the stuff in the sand somewhere.

    The US might have damaged the location, but believing they had in any significant form damaged the program is moot. On the contrary, Iran now has the irrefutable proof that the US does not care even about their own secret services report that Iran had given up (or at least was not actively working on) on the bomb. Now they have the incentive to actually build it so they can use it as a deterrent and if needed, in self-defence.









  • The problem lies in the PDFs themselves. In there are objects that represent lines of glyphs. If you are lucky. A conversion tool can guess which of those lines belong together and produce the text.

    It cannot know any intentions behind it, though. Take a numbered list. The first line is two line objects: the number plus the . or the ), and the first line of text. The conversion tool can now guess. As the line blocks with the numbers are all left of the line blocks with text, this could be a numbered list. Or it could be a table with two columns. Nothing in the PDF is giving any hints.

    And that is the easy part. This assumes that the document either uses default fonts, or keeps its embedded fonts untouched. If they use embedded fonts and a PDF optimizer that only embeds the used characters and renumbers them, any copy or conversion tool is bound to fail.

    Same with protected PDFs where you simply cannot copy the text from the start.

    And then there are PDFs that just consist of scanned pages. Here you would need an OCR software to get something readable out of them.

    PDF is an archival, output format, the end of a process. Not something to work from.

    Always preserve the original file. Keep it safe. If you change tools, make sure you have a conversion path into something editable. The PDF is for giving away, nothing else.