Structured data vs. unstructured data

Explaining what structured data is and what it means quickly leads to its counterpart, which is unstructured data. Examples of unstructured data include analogue or digital text documents, audio files, videos, and images. Such content contains a range of relevant data such as personal names, locations, or quantities, but in a ‘free’, unspecified form.

The challenge with such data is that it is hard to organise or manage into further forms. Only structured data can be managed and used efficiently. This is especially true for electronic data processing solutions and Internet applications. Online shops, news portals, weather services and sports sites process tremendous amounts of information. The applications can only handle data that is presented in tabular form, i.e. in columns and rows.

SQL (structured query language)

While an Excel spreadsheet is enough for manageable stocks, databases are needed to organise large amounts of information. The database language SQL has established itself for the administration of structured data. SQL makes it possible to store, search, add, update, and delete data of any size.

The syntax of the database language is comparatively intuitive and query commands like ‘SELECT’, ‘FROM’, ‘ORDER BY’, are taken from the English language. SQL offers a programming interface to other coding languages, including C, C++, COBOL, Ada, Java, and C#.

Converting unstructured data

It is estimated that 85-90% of data available online is unstructured. Many common formats of data on the internet—pdf, mp4, jpeg, docx, and HTML—cannot be stored easily in a database.

Before data can be used from such unstructured formats, it must first be extracted from the content and stored in a table. This is done, for example, with semantic analysis methods. Algorithms scan content, such as articles from an online news portal, recognize the relevant phenomena and content, and summarise into machine-readable tables.

Structured data in natural language generation

Structured data is the basis for creating new content through text automation. Natural Language Generation (NLG) applications make it possible to import large amounts of data into a system via direct upload or API.

Data in structured form is a prerequisite for an NLG system to be able to generate content. Another prerequisite are text modules or gap texts. The application replaces placeholders with information from the file with structured data. In practice, such processes are used to create weather reports, sports reports, real estate descriptions, or financial reports.

Quellen:

Back to the news overview