Selected publications

A Framework for Search as Learning Experiments: Design, Implementation, and Usability Insights

Autores: Joel H. N. de O. Silva, Alfredo Neto, Breno Rosado, Marcelo Machado, Jairo F. de Souza, Sean W. M. Siqueira
Publicado em: XIV Congresso Brasileiro de Informática na Educação (CBIE 2025)
Abstract: Search as Learning (SAL) explores how users engage with search systems to acquire knowledge and develop understanding. Despite advances in SAL, the lack of general-purpose tools hinders reproducibility and standardization in experimental studies. This paper presents a framework to support researchers in designing SAL experiments, encompassing task creation, data collection, and learning assessment. To evaluate the proposal, we conducted a usability study with 12 participants, which yielded a score of 83.07, indicating excellent usability. Feedback of the participants also provided suggestions for improvement, guiding future development. This work contributes to strengthening methodological practices and fostering reproducibility in SAL research.

An evolutionary approach for the automatic generation of word list fluency assessment items

Autores: Rômulo C. de Mello, Gustavo Silva, Patrick C. de Carvalho, Rafaela Lopes, Maria Clara C. Carneiro, Jairo F. de Souza
Publicado em: XIV Congresso Brasileiro de Informática na Educação (CBIE 2025)
Abstract: This paper presents a Genetic Algorithm (GA) to automate the generation of reading fluency assessment items, reducing manual effort while meeting pedagogical constraints. Candidate solutions are sequences of words optimized by a multi-objective function that penalizes constraint violations and repetitions. Constraints include canonicity, syllabic variety, grapheme presence, and prosodic continuity. Experiments show that the GA effectively produces valid word lists, with larger populations yielding faster and more stable convergence. A 5% mutation rate was sufficient to preserve diversity. The method is flexible, scalable, and aligned with educational standards.

Conceptual model for Cognitive Biases in Search as Learning.

Cognitive Biases in Search as Learning: Bridging Conceptual Foundations and Empirical Research

Authors: Marcelo de Oliveira Costa Machado, Jairo Francisco de Souza, Sean W. M. Siqueira
Published in: Anais Estendidos do XIV Congresso Brasileiro de Informática na Educação (2025)
Abstract: Online search is a key part of how people learn, yet it is not a neutral process. Cognitive biases shape how users search for, select, and interpret information. While the Search as Learning (SAL) field studies how people learn throughout the search process, it has not yet integrated research on cognitive biases. This paper bridges that gap by proposing a conceptual model and an experimental framework that connect SAL with cognitive bias research. We conducted a real-world experiment on confirmation bias, involving learners searching for “The use of AI in education”. It showed how prior beliefs influence search behaviors. This work advances both the theoretical understanding and empirical study of biases in SAL, supporting the development of fairer, more transparent, and educationally effective search technologies.

Finite-State Transducers for Oral Spelling Detection

Autores: Gabriel J. R. Soares, José E. C. Silva, Jairo F. de Souza
Publicado em: XIV Congresso Brasileiro de Informática na Educação (CBIE 2025)
Abstract: Reading fluency assessment plays a central role in early education systems worldwide. Countries such as the United States and Brazil administer large-scale oral reading assessments to monitor educational outcomes and guide intervention. However, most of the automatic assessments are often coarse in granularity. As a result, they are poorly equipped to handle children who do not yet decode words fluently and instead rely on spelling out individual letters or syllables. We show that finite-state transducers can be used to detect spelling to improve oral reading assessments. We demonstrate the effectiveness of our method on a corpus of annotated child speech, showing that it provides insight into early decoding strategies.

Improving automated literacy assessments through a multiple output grapheme-to-phoneme approach

Autores: Rômulo C. de Mello, Patrick C. de Carvalho, Gustavo Silva, Maria Clara C. Carneiro, Rafaela Lopes, Jairo F. de Souza
Publicado em: XIV Congresso Brasileiro de Informática na Educação (CBIE 2025)
Abstract: Fluency assessments are essential for monitoring literacy development, but automatic systems still struggle with the phonetic diversity of Brazilian Portuguese and the specific characteristics of children’s reading. We propose a rule-based grapheme-to-phoneme converter that generates multiple acceptable transcriptions per word, accounting for regional variations and child speech. Validated on children’s reading data, the module reduces errors, increases accuracy from 89% to 95% on the PARC-2019 dataset, and improves performance among inconsistent readers. Flexible G2Ps make assessments fairer and more reliable.

A hereditary attentive question answering framework for knowledge bases

Authors: Rômulo C. de Mello, Jorão Gomes Jr., Jairo Francisco de Souza, Victor Ströele
Published in: Journal on Interactive Systems (2025)
Abstract: Background. The rapid growth of online data has made retrieving relevant information a challenging task, prompting the rise of Knowledge Base Question Answering (KBQA) systems that handle complex, multi-hop queries. Purpose. This extended work refines our previous pipeline by introducing structured dummy templates, a Hereditary Tree-LSTM (HTL) for classification, and more comprehensive analyses of entity recognition, property extraction, and SPARQL assembly. Methods. We enhanced the LC-QUAD 2.1 dataset with standardized templates and evaluated a flexible pipeline that integrates DeepPavlov, Falcon, SpaCy, qualifiers constraints, and reverse lookups. Results. Our experiments reveal that multi-tool entity recognition outperforms single-tool methods, while property extraction benefits from extended property sets and refined ranking strategies. Overall SPARQL correctness reaches up to 70–80% in mid-complex queries but remains lower in domain-specific subsets. Conclusion. The proposed synergy of NLP tools and refined dummy templates increases coverage for complex KBQA, though further improvements in morphological handling and specialized embeddings may be needed to address challenging multi-hop or niche queries comprehensively.

KIF-QA: Using Off-the-shelf LLMs to Answer Simple Questions over Heterogeneous Knowledge Bases

Authors: Marcelo Machado, João Pedro Porto Campos, Guilherme Lima, Viviane Torres da Silva
Published in: Joint Proc. 5th Wikidata Workshop for the Scientific Wikidata Community co-located with the 24th International Semantic Web Conference (ISWC 2025)
Abstract: We present KIF-QA, a semantic parsing-based approach for answering simple questions over heterogeneous knowledge bases. KIF-QA uses off-the-shelf pre-trained large language models (LLMs) and in-context (few-shot) learning to transform questions into interpretable logical forms (queries) without requiring any fine-tuning. Because it uses KIF (the knowledge integration framework) to mediate all access to the underlying knowledge base, KIF-QA can be easily adapted to target any base accessible through KIF (which out-of-the-box includes Wikidata, DBpedia, PubChem, and others). We evaluate KIF-QA over the Wikidata and DBpedia versions of the SimpleQuestions benchmark using Llama 3.3, Llama 4 Maverick, and Mistral Medium 3. The results show competitive performance to comparable state-of-the-art methods. KIF-QA implementation is made available under an open-source license.

Improving learning material repositories using student profiles

Authors: Natalie Ferraz Silva Bravo, André Ferreira Martins, Thales Brito de Souza Fonseca Rodrigues, Marcelo Machado, Heder Soares Bernardino, Alex Borges Vieira, Hélio José Corrêa Barbosa, Jairo Francisco de Souza
Published in: Soft Computing (2025)
Abstract: The adaptive learning community seeks to provide solutions to customize and enhance students’ learning experiences when accessing web-based learning systems. The adaptation usually occurs from the use of learning materials and user information data, which turns the adaptation process highly dependent on the quality of the repositories. Then, the best adaptation a system may offer might still not satisfy the users’ needs. In this work, we propose an approach to assist teachers and stakeholders in understanding repositories’ characteristics and their gaps according to students’ needs. Our approach, first, selects the best sequence of learning materials for each student, which is a well-known problem called Adaptive Curriculum Sequencing. Then, based on the selected sequences, we use optimization approaches, such as GRASP and Simulated Annealing, to generate new learning materials possibilities that can improve ACS recommendations. This way, our new approach assists teachers in assembling their learning materials. We have evaluated our approach by comparing it to a traditional approach using a real dataset, and the results are promising. In fact, it is possible to design customized materials using a combination of GRASP and brute force algorithms on the characteristics of the learning materials.

TV 3.0 Audience Measurement Management: Architecture, Lifecycle, and APIs

Authors: Marcelo Moreno, Eduardo Barrére
Published in: SET INTERNATIONAL JOURNAL OF BROADCAST ENGINEERING (2025)
Abstract: TV 3.0 AMM represents a comprehensive, regulation-aligned, and technically robust audience measurement solution. It supports both declarative and procedural control models, accommodates flexible delivery strategies, and ensures that all data collection is traceable, verifiable, and appropriately scoped. Compared to our previous presentation at this congress, which introduced the initial design rationale and conceptual framework, the present work delivers the complete specification, including the finalized state machine, report schema, digital trust infrastructure, and API specs, thereby advancing the proposal from theoretical blueprint to deployable standard.

TV 3.0 Privacy Management: Signalling, Enforcement and Rights Control

Authors: Marcelo Moreno, Eduardo Barrére, Débora Muchaluat-Saade
Published in: SET INTERNATIONAL JOURNAL OF BROADCAST ENGINEERING (2025)
Abstract: TV 3.0 AMM represents a comprehensive, regulation-aligned, and technically robust audience measurement solution. It supports both declarative and procedural control models, accommodates flexible delivery strategies, and ensures that all data collection is traceable, verifiable, and appropriately scoped. Compared to our previous presentation at this congress, which introduced the initial design rationale and conceptual framework, the present work delivers the complete specification, including the finalized state machine, report schema, digital trust infrastructure, and API specs, thereby advancing the proposal from theoretical blueprint to deployable standard.

Exploring the solution space for adaptive curriculum sequencing: Study of a multi-objective approach

Authors: João Vítor de Castro Martins Ferreira Nogueira, Heder Soares Bernardino, Jairo Francisco de Souza, Luciana Brugiolo Gonçalves, Stênio Sã Rosário Furtado Soares
Published in: Internet of Things (2024)
Abstract: Adaptive Curriculum Sequencing (ACS) is an important issue in personalized learning. In ACS problems, one desires the best sequence of learning materials that meet the profile of a given student. To do so, multiple features of the students and the materials used are necessary to generate good solutions. In fact, understanding the students’ goals, motivation, and preferences is not an easy task and, consequently, different Internet of Things (IoT) approaches to gather this information during the learning process have been proposed. Actually, some works from the literature consider five objectives and, in this case, one has a many-objective optimization problem. Instead of solving the optimization problem considering the multiple objectives individually, the usual approach is to obtain solutions for a weighted sum of the objective values using search approaches for mono-objective optimization problems. However, this kind of approach may bias the search and limits the capacity of finding good results. Here, we solve the multi-objective ACS problem considering five objective functions. NSGA-II, a well-known Genetic Algorithm for multi-objective optimization problems, was used. In addition, the aggregation trees were employed to reduce the number of objectives to two and three due to the large number of objectives in the original problem. ACS problems from the literature were used to comparatively evaluate the proposed methods and the results obtained were compared to those found by the traditional approach of summing the objective values. According to these results, the best curriculum sequences were reached when using the proposal.

Identifying Confirmation Bias in a Search as Learning Task: A Study on The Use of Artificial Intelligence in Education

Authors: Marcelo Machado, Jairo Francisco de Souza, Sean W. M. Siqueira
Published in: Anais do XXXV Simpósio Brasileiro de Informática na Educação (SBIE 2024)
Abstract: Confirmation bias, the tendency to favor information that supports existing beliefs, can hinder information-seeking, especially in learning contexts where it can perpetuate a one-sided perspective. This paper examines how confirmation bias affects search behaviors among 84 participants learning about AI in education. Participants were divided into Neutral and Biased groups based on their prior attitudes, with the Biased group receiving reinforcing information beforehand. Participants’ interactions with the search system were logged, and we analyzed the data for behavioral differences. Results showed that biased participants often completed searches quickly, spending less time engaging with and selecting search results, and issued longer queries. However, other variables showed no statistical difference. Some results contradict other studies on confirmation bias in search, highlighting the complexity of search dynamics in learning contexts and suggesting the need for specialized research into cognitive biases in search as a learning process.

Formalizing and validating Wikidata’s property constraints using SHACL and SPARQL

Authors: Nicolas Ferranti, Jairo Francisco De Souza, Shqiponja Ahmetajand, Axel Polleres
Published in: Semantic Web (2024)
Abstract: In this paper, we delve into the crucial role of constraints in maintaining data integrity in knowledge graphs with a specific focus on Wikidata, one of the most extensive collaboratively maintained open data knowledge graphs on the Web. The World Wide Web Consortium (W3C) recommends the Shapes Constraint Language (SHACL) as the constraint language for validating Knowledge Graphs, which comes in two different levels of expressivity, SHACL-Core, as well as SHACL-SPARQL. Despite the availability of SHACL, Wikidata currently represents its property constraints through its own RDF data model, which relies on Wikidata’s specific reification mechanism based on authoritative namespaces, and – partially ambiguous – natural language definitions. In the present paper, we investigate whether and how the semantics of Wikidata property constraints, can be formalized using SHACL-Core, SHACL-SPARQL, as well as directly as SPARQL queries. While the expressivity of SHACL-Core turns out to be insufficient for expressing all Wikidata property constraint types, we present SPARQL queries to identify violations for all 32 current Wikidata constraint types. We compare the semantics of this unambiguous SPARQL formalization with Wikidata’s violation reporting system and discuss limitations in terms of evaluation via Wikidata’s public SPARQL query endpoint, due to its current scalability. Our study, on the one hand, sheds light on the unique characteristics of constraints defined by the Wikidata community, in order to improve the quality and accuracy of data in this collaborative knowledge graph. On the other hand, as a “byproduct”, our formalization extends existing benchmarks for both SHACL and SPARQL with a challenging, large-scale real-world use case.

LLM Store: Leveraging Large Language Models as Sources of Wikidata-Structured Knowledge.

Authors: Marcelo de Oliveira Costa Machado, João Marcello Bessa Rodrigues, Guilherme Lima, Sandro Rama Fiorini, Viviane Torres da Silva
Published in: CEUR Workshop Proceedings (2024)
Abstract: Knowledge Integration Framework (KIF) is a Wikidata-based framework for integrating heterogeneous knowledge sources. These can be SPARQL endpoints, SQL endpoints, RDF files, CSV files, etc., and are represented in KIF as knowledge “stores”. A KIF store exposes a Wikidata view of the underlying knowledge source by interpreting its content as a set of Wikidata-like statements and allowing it to be queried through a simple but expressive pattern-matching interface. In this paper, we present LLM Store, a KIF store implementation that uses language models (LLMs) as knowledge sources. Instead of consulting a static knowledge base, when queried, the LLM Store uses the underlying LLM to synthesize Wikidata-like statements on-the-fly. The knowledge completion pipeline used by LLM Store can be fully customized and supports strategies that range from simple zero-shot prompts to retrieval-augment generation (RAG). This paper discusses the design and implementation of LLM Store and presents an evaluation using the test and validation datasets of LM-KBC Challenge @ ISWC 2024. We analyze the results of the evaluation in light of the results obtained by our submission to the same challenge, which was based on LLM Store and achieved a macro averaged F1-score of 91%. LLM Store is released as open-source and its code is available at https://github.com/IBM/kif-llm-store.

Automatic Classification of Learning Material Styles

Authors: Bernadete Aquino, Jairo Francisco de Souza, Eduardo Barrére
Published in: Revista Brasileira de Informática na Educação (2023)
Abstract: Although video lessons are often used in diverse areas, the lack of a common approach to defining and classifying their styles results in using many different models for these purposes. There is a need to build a framework through which these styles can be defined and classified. Much has been done to investigate the effects of these styles on student engagement and learning outcomes. These studies suggest that video lesson styles affect academic performance and that students learn better through a certain video lesson style. Based on this, we propose a unified model for classifying video lesson styles based on the nomenclatures and definitions used in the literature. Furthermore, we present an approach for automatically classifying four popular video lesson styles. The automatic classification is useful for recommendation systems to suggest materials more consistent with student preferences and their intended learning outcomes.

A study of approaches to answering complex questions over knowledge bases

Authors: Rômulo C. de Mello, Jorão Gomes Jr., Jairo Francisco de Souza, Victor Ströele
Published in: Knowledge and Information Systems (2022)
Abstract: Integrated Broadcast-Broadband (IBB) Systems enable a range of new services for viewers, including content personalization, such as targeted advertising. Custom advertising is delivered using the broadband connection synchronized with the broadcast content, demanding high-control synchronization among both communication channels. As the broadband channel is shared among different applications, delays may occur during the transmission of targeted advertisements, which can cause synchronization faults. Aiming at reducing these faults and truly supporting targeted advertising in IBB systems, this work proposes the automatic preparation of media content before its presentation. In order to provide automatic preparation, a preparation plan to orchestrate content loading in advance is also proposed. Our proposal is implemented in the Ginga-NCL middleware, which is a standard for the Brazilian Digital TV System. Our performance evaluation shows that the average switching time from broadcast to broadband is around 40 ms, if automatic content preparation is used, which is an outstanding result.

A hereditary attentive template-based approach for complex Knowledge Base Question Answering systems

Authors: Rômulo C. de Mello, Jorão Gomes Jr., Jairo Francisco de Souza, Victor Ströele
Published in: EXPERT SYSTEMS WITH APPLICATIONS (2022)
Abstract: Knowledge Base Question Answering systems (KBQA) aim to find answers to natural language questions over a knowledge base. This work presents a template matching approach for Complex KBQA systems (C-KBQA) using the combination of Semantic Parsing and Neural Networks techniques to classify natural language questions into answer templates. An attention mechanism was created to assist a Tree-LSTM in selecting the most important information. The approach was evaluated on the LC-Quad 1, LC-Quad 2, ComplexWebQuestion, and WebQuestionsSP datasets, and the results show that our approach outperforms other approaches on three datasets.

Supporting Targeted Advertising in Integrated Broadcast-Broadband Systems With Automatic Media Content Preparation

Authors: Marina Ivanov P. Josué, Marcelo Moreno, Débora Muchaluat-Saade
Published in: IEEE Access (2022)
Abstract: Integrated Broadcast-Broadband (IBB) Systems enable a range of new services for viewers, including content personalization, such as targeted advertising. Custom advertising is delivered using the broadband connection synchronized with the broadcast content, demanding high-control synchronization among both communication channels. As the broadband channel is shared among different applications, delays may occur during the transmission of targeted advertisements, which can cause synchronization faults. Aiming at reducing these faults and truly supporting targeted advertising in IBB systems, this work proposes the automatic preparation of media content before its presentation. In order to provide automatic preparation, a preparation plan to orchestrate content loading in advance is also proposed. Our proposal is implemented in the Ginga-NCL middleware, which is a standard for the Brazilian Digital TV System. Our performance evaluation shows that the average switching time from broadcast to broadband is around 40 ms, if automatic content preparation is used, which is an outstanding result.

Page updated

Google Sites

Report abuse

Selected publications

A Framework for Search as Learning Experiments: Design, Implementation, and Usability Insights

An evolutionary approach for the automatic generation of word list fluency assessment items

Cognitive Biases in Search as Learning: Bridging Conceptual Foundations and Empirical Research

Finite-State Transducers for Oral Spelling Detection

Improving automated literacy assessments through a multiple output grapheme-to-phoneme approach

A hereditary attentive question answering framework for knowledge bases

KIF-QA: Using Off-the-shelf LLMs to Answer Simple Questions over Heterogeneous Knowledge Bases

Improving learning material repositories using student profiles

TV 3.0 Audience Measurement Management: Architecture, Lifecycle, and APIs

TV 3.0 Privacy Management: Signalling, Enforcement and Rights Control

Exploring the solution space for adaptive curriculum sequencing: Study of a multi-objective approach

Identifying Confirmation Bias in a Search as Learning Task: A Study on The Use of Artificial Intelligence in Education

Formalizing and validating Wikidata’s property constraints using SHACL and SPARQL

LLM Store: Leveraging Large Language Models as Sources of Wikidata-Structured Knowledge.

Automatic Classification of Learning Material Styles

A study of approaches to answering complex questions over knowledge bases

A hereditary attentive template-based approach for complex Knowledge Base Question Answering systems

Supporting Targeted Advertising in Integrated Broadcast-Broadband Systems With Automatic Media Content Preparation

LApiC Research Lab