The results of a study on the intellectual support of the process of automatic data extraction from text documents are presented, which made it possible to form a corpus of documents extracted from large volumes of semi-structured texts without reprocessing and adaptation, without requiring much work to determine the appropriate work plans for subjects. The subject of the study is the content of working curricula (syllabuses), defined as a set of data characterizing the learning outcomes and the content of the subject. As a result of the work, the authors created a corpus of texts from documents of working curricula on the subjects of the educational program “Information Systems”. In the future, it is planned to obtain a matrix of cosine distances to identify similar documents on the educational content of working curricula.
key words: data mining, corpus of document, natural language processing, unstructured data, educational content.