How your unstructured data is processed?
How your
unstructured data
is processed?
DATA INGESTION
Unstructured documents come in a vast range of formats and layouts. The files are commonly in .pdf, .docx, .xlsx, pptx, .las and .segy formats:
The files are ingested through a consecutive pipeline of workflows using machine learning techniques.
The workflow for automatically extracting information from the documents starts with a set of heuristic algorithms to identify blocks/segments within a document, after which, supervised machine learning is used to classify the document segments as either text or non-text.
DATA INGESTION
Unstructured documents come in a vast range of formats and layouts. The files are commonly in .pdf, .docx, .xlsx, pptx, .las and .segy formats:
The files are ingested through a consecutive pipeline of workflows using machine learning techniques.
The workflow for automatically extracting information from the documents starts with a set of heuristic algorithms to identify blocks/segments within a document, after which, supervised machine learning is used to classify the document segments as either text or non-text.
Text
Optical Character Recognition (OCR) is applied to the text segments to convert them into editable text. Named-entity Recognition (NER) and Pattern-Based Recognition (PBR) techniques are applied to these OCR results in order to extract metadata from for example a well report such as well name, kelly-bushing, spud dates, and contractors.
Text
Optical Character Recognition (OCR) is applied to the text segments to convert them into editable text. Named-entity Recognition (NER) and Pattern-Based Recognition (PBR) techniques are applied to these OCR results in order to extract metadata from for example a well report such as well name, kelly-bushing, spud dates, and contractors.
Images and Tables
On a separate data pipeline, the non-text components such as images and tables are tagged and using deep convolutional neural networks (DCNN), the machine learns to auto classify different image types, including seismic images, stratigraphic charts, maps, cores, drawings, and tables to enable aggregation of the images per type.
