People
Latest Research Publications:
Latest Research Publications:
Latest Research Publications:
Marelie obtained her undergraduate degrees (Computer Science & Mathematics) from Stellenbosch University, receiving the Dean’s medal as best student in the US Faculty of Science at the end of her Honours degree. Prior to joining NWU, Marelie was a principal researcher and research group leader at the South African CSIR, involved in technology-oriented research and development. Her research group focussed on speech technology development in under-resourced environments; in 2005, she received her PhD from the University of Pretoria (UP), with a thesis on bootstrapping pronunciation models, at the time one of the core ‘missing’ components when developing speech recognition for South African languages.
In 2011, Marelie joined NWU, becoming the Director of MuST in 2014. MuST is a focussed research environment with an emphasis on postgraduate training and delivering on externally-focussed projects. Recent projects include the development of an automatic speech transcription platform for the South African government, development of a new multilingual text-to-speech corpus in collaboration with Internet giant Google, and being part of the winning consortium of the BABEL project: a 5-year internationally collaborative challenge aimed at solving the spoken term detection task for under-resourced languages.
Over the past few years, Marelie has supervised 23 post-graduate students, all producing research related to the theory and applications of machine learning. She frequently participates in various scientific committees both nationally and internationally (AAAI, IJCAI, Interspeech, SLT, MediaEval, ICASSP, SLTU), is the NWU group representative at the national Centre for Artificial Intelligence Research (CAIR), and an NRF-rated researcher. Since 2003, she has published 100 peer-reviewed papers related to machine learning; she has an h-index of 21, and an i10-index of 37.
Latest Research Publications:
We propose a new framework to improve automatic speech recognition (ASR) systems in resource-scarce environments using a generative adversarial network (GAN) operating on acoustic input features. The GAN is used to enhance the features of mismatched data prior to decoding, or can optionally be used to fine-tune the acoustic model. We achieve improvements that are comparable to multi-style training (MTR), but at a lower computational cost. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER). Experiments demonstrate that the framework can be very useful in under-resourced environments where training data and computational resources are limited. The GAN does not require parallel training data, because it utilises a baseline acoustic model to provide an additional loss term that guides the generator to create acoustic features that are better classified by the baseline.
@article{492, author = {Walter Heymans, Marelie Davel, Charl Van Heerden}, title = {Efficient acoustic feature transformation in mismatched environments using a Guided-GAN}, abstract = {We propose a new framework to improve automatic speech recognition (ASR) systems in resource-scarce environments using a generative adversarial network (GAN) operating on acoustic input features. The GAN is used to enhance the features of mismatched data prior to decoding, or can optionally be used to fine-tune the acoustic model. We achieve improvements that are comparable to multi-style training (MTR), but at a lower computational cost. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER). Experiments demonstrate that the framework can be very useful in under-resourced environments where training data and computational resources are limited. The GAN does not require parallel training data, because it utilises a baseline acoustic model to provide an additional loss term that guides the generator to create acoustic features that are better classified by the baseline.}, year = {2022}, journal = {Speech Communication}, volume = {143}, pages = {10 - 20}, month = {09/2022}, doi = {https://doi.org/10.1016/j.specom.2022.07.002}, }
The accurate estimation of channel state information (CSI) is an important aspect of wireless communications. In this paper, a multi-layer perceptron (MLP) is developed as a CSI estimator in long-term evolution (LTE) transmission conditions. The representation of the CSI data is investigated in conjunction with batch normalisation and the representational ability of MLPs. It is found that discontinuities in the representational feature space can cripple an MLP’s ability to accurately predict CSI when noise is present. Different ways in which to mitigate this effect are analysed and a solution developed, initially in the context of channels that are only affected by additive white
Guassian noise. The developed architecture is then applied to more complex channels with various delay profiles and Doppler spread. The performance of the proposed MLP is shown to be comparable with LTE minimum mean squared error (MMSE), and to outperform least square (LS) estimation over a range of channel conditions.
@{491, author = {Andrew Oosthuizen, Marelie Davel, Albert Helberg}, title = {Multi-Layer Perceptron for Channel State Information Estimation: Design Considerations}, abstract = {The accurate estimation of channel state information (CSI) is an important aspect of wireless communications. In this paper, a multi-layer perceptron (MLP) is developed as a CSI estimator in long-term evolution (LTE) transmission conditions. The representation of the CSI data is investigated in conjunction with batch normalisation and the representational ability of MLPs. It is found that discontinuities in the representational feature space can cripple an MLP’s ability to accurately predict CSI when noise is present. Different ways in which to mitigate this effect are analysed and a solution developed, initially in the context of channels that are only affected by additive white Guassian noise. The developed architecture is then applied to more complex channels with various delay profiles and Doppler spread. The performance of the proposed MLP is shown to be comparable with LTE minimum mean squared error (MMSE), and to outperform least square (LS) estimation over a range of channel conditions.}, year = {2022}, journal = {Southern Africa Telecommunication Networks and Applications Conference (SATNAC)}, pages = {94 - 99}, month = {08/2022}, address = {Fancourt, George}, }
We report on the development of two reference corpora for the analysis of SepediEnglish code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.
@article{483, author = {Thipe Modipa, Marelie Davel}, title = {Two Sepedi‑English code‑switched speech corpora}, abstract = {We report on the development of two reference corpora for the analysis of SepediEnglish code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.}, year = {2022}, journal = {Language Resources and Evaluation}, volume = {56}, pages = {https://rdcu.be/cO6lD)}, publisher = {Springer}, address = {South Africa}, url = {https://rdcu.be/cO6lD}, doi = {https://doi.org/10.1007/s10579-022-09592-6 (Read here: https://rdcu.be/cO6lD)}, }
Mismatched data is a challenging problem for automatic speech recognition (ASR) systems. One of the most common techniques used to address mismatched data is multi-style training (MTR), a form of data augmentation that attempts to transform the training data to be more representative of the testing data; and to learn robust representations applicable to different conditions. This task can be very challenging if the test conditions are unknown. We explore the impact of different MTR styles on system performance when testing conditions are different from training conditions in the context of deep neural network hidden Markov model (DNN-HMM) ASR systems. A controlled environment is created using the LibriSpeech corpus, where we isolate the effect of different MTR styles on final system performance. We evaluate our findings on a South African call centre dataset that contains noisy, WAV49-encoded audio.
@article{480, author = {Walter Heymans, Marelie Davel, Charl Van Heerden}, title = {Multi-style Training for South African Call Centre Audio}, abstract = {Mismatched data is a challenging problem for automatic speech recognition (ASR) systems. One of the most common techniques used to address mismatched data is multi-style training (MTR), a form of data augmentation that attempts to transform the training data to be more representative of the testing data; and to learn robust representations applicable to different conditions. This task can be very challenging if the test conditions are unknown. We explore the impact of different MTR styles on system performance when testing conditions are different from training conditions in the context of deep neural network hidden Markov model (DNN-HMM) ASR systems. A controlled environment is created using the LibriSpeech corpus, where we isolate the effect of different MTR styles on final system performance. We evaluate our findings on a South African call centre dataset that contains noisy, WAV49-encoded audio.}, year = {2022}, journal = {Communications in Computer and Information Science}, volume = {1551}, pages = {111 - 124}, publisher = {Southern African Conference for Artificial Intelligence Research}, address = {South Africa}, doi = {https://doi.org/10.1007/978-3-030-95070-5_8}, }
While deep neural networks (DNNs) have become a standard architecture for many machine learning tasks, their internal decision-making process and general interpretability is still poorly understood. Conversely, common decision trees are easily interpretable and theoretically well understood. We show that by encoding the discrete sample activation values of nodes as a binary representation, we are able to extract a decision tree explaining the classification procedure of each layer in a ReLU-activated multilayer perceptron (MLP). We then combine these decision trees with existing feature attribution techniques in order to produce an interpretation of each layer of a model. Finally, we provide an analysis of the generated interpretations, the behaviour of the binary encodings and how these relate to sample groupings created during the training process of the neural network.
@article{479, author = {Coenraad Mouton, Marelie Davel}, title = {Exploring layerwise decision making in DNNs}, abstract = {While deep neural networks (DNNs) have become a standard architecture for many machine learning tasks, their internal decision-making process and general interpretability is still poorly understood. Conversely, common decision trees are easily interpretable and theoretically well understood. We show that by encoding the discrete sample activation values of nodes as a binary representation, we are able to extract a decision tree explaining the classification procedure of each layer in a ReLU-activated multilayer perceptron (MLP). We then combine these decision trees with existing feature attribution techniques in order to produce an interpretation of each layer of a model. Finally, we provide an analysis of the generated interpretations, the behaviour of the binary encodings and how these relate to sample groupings created during the training process of the neural network.}, year = {2022}, journal = {Communications in Computer and Information Science}, volume = {1551}, pages = {140 - 155}, publisher = {Artificial Intelligence Research (SACAIR 2021)}, doi = {https://doi.org/10.1007/978-3-030-95070-5_10}, }
Latest Research Publications:
Latest Research Publications:
Amongst her other qualifications, she holds a PhD in Engineering Science from the North West University, South Africa.
Latest Research Publications:
Many posterior distributions take intractable forms and thus
require variational inference where analytical solutions cannot be found.
Variational Inference and Monte Carlo Markov Chains (MCMC) are established mechanism to approximate these intractable values. An alternative approach to sampling and optimisation for approximation is a direct mapping between the data and posterior distribution. This is made
possible by recent advances in deep learning methods. Latent Dirichlet
Allocation (LDA) is a model which offers an intractable posterior of this
nature. In LDA latent topics are learnt over unlabelled documents to
soft cluster the documents. This paper assesses the viability of learning
latent topics leveraging an autoencoder (in the form of Autoencoding
variational Bayes) and compares the mimicked posterior distributions to
that achieved by VI. After conducting various experiments the proposed
AEVB delivers inadequate performance. Under Utopian conditions comparable conclusion are achieved which are generally unattainable. Further, model specification becomes increasingly complex and deeply circumstantially dependant - which is in itself not a deterrent but does warrant consideration. In a recent study, these concerns were highlighted and
discussed theoretically. We confirm the argument empirically by dissecting the autoencoder’s iterative process. In investigating the autoencoder,
we see performance degrade as models grow in dimensionality. Visualization of the autoencoder reveals a bias towards the initial randomised
topics.
@{254, author = {Zach Wolpe, Alta de Waal}, title = {Autoencoding variational Bayes for latent Dirichlet allocation}, abstract = {Many posterior distributions take intractable forms and thus require variational inference where analytical solutions cannot be found. Variational Inference and Monte Carlo Markov Chains (MCMC) are established mechanism to approximate these intractable values. An alternative approach to sampling and optimisation for approximation is a direct mapping between the data and posterior distribution. This is made possible by recent advances in deep learning methods. Latent Dirichlet Allocation (LDA) is a model which offers an intractable posterior of this nature. In LDA latent topics are learnt over unlabelled documents to soft cluster the documents. This paper assesses the viability of learning latent topics leveraging an autoencoder (in the form of Autoencoding variational Bayes) and compares the mimicked posterior distributions to that achieved by VI. After conducting various experiments the proposed AEVB delivers inadequate performance. Under Utopian conditions comparable conclusion are achieved which are generally unattainable. Further, model specification becomes increasingly complex and deeply circumstantially dependant - which is in itself not a deterrent but does warrant consideration. In a recent study, these concerns were highlighted and discussed theoretically. We confirm the argument empirically by dissecting the autoencoder’s iterative process. In investigating the autoencoder, we see performance degrade as models grow in dimensionality. Visualization of the autoencoder reveals a bias towards the initial randomised topics.}, year = {2019}, journal = {Proceedings of the South African Forum for Artificial Intelligence Research}, pages = {25-36}, month = {12/09}, publisher = {CEUR Workshop Proceedings}, isbn = {1613-0073}, url = {http://ceur-ws.org/Vol-2540/FAIR2019_paper_33.pdf}, }
Environmental information is acquired and assessed during the environmental impact assessment process for surface‐strip coal mine approval. However, integrating these data and quantifying rehabilitation risk using a holistic multidisciplinary approach is seldom undertaken. We present a rehabilitation risk assessment integrated network (R2AIN™) framework that can be applied using Bayesian networks (BNs) to integrate and quantify such rehabilitation risks. Our framework has 7 steps, including key integration of rehabilitation risk sources and the quantification of undesired rehabilitation risk events to the final application of mitigation. We demonstrate the framework using a soil compaction BN case study in the Witbank Coalfield, South Africa and the Bowen Basin, Australia. Our approach allows for a probabilistic assessment of rehabilitation risk associated with multidisciplines to be integrated and quantified. Using this method, a site's rehabilitation risk profile can be determined before mining activities commence and the effects of manipulating management actions during later mine phases to reduce risk can be gauged, to aid decision making
@article{253, author = {Vanessa Weyer, Alta de Waal, Alex Lechner, Corinne Unger, Tim O'Connor, Thomas Baumgartl, Roland Schulze, Wayne Truter}, title = {Quantifying rehabilitation risks for surface‐strip coal mines using a soil compaction Bayesian network in South Africa and Australia: To demonstrate the R2AIN Framework}, abstract = {Environmental information is acquired and assessed during the environmental impact assessment process for surface‐strip coal mine approval. However, integrating these data and quantifying rehabilitation risk using a holistic multidisciplinary approach is seldom undertaken. We present a rehabilitation risk assessment integrated network (R2AIN™) framework that can be applied using Bayesian networks (BNs) to integrate and quantify such rehabilitation risks. Our framework has 7 steps, including key integration of rehabilitation risk sources and the quantification of undesired rehabilitation risk events to the final application of mitigation. We demonstrate the framework using a soil compaction BN case study in the Witbank Coalfield, South Africa and the Bowen Basin, Australia. Our approach allows for a probabilistic assessment of rehabilitation risk associated with multidisciplines to be integrated and quantified. Using this method, a site's rehabilitation risk profile can be determined before mining activities commence and the effects of manipulating management actions during later mine phases to reduce risk can be gauged, to aid decision making}, year = {2019}, journal = {Integrated Environmental Assessment and Management}, volume = {15}, pages = {190-208}, issue = {2}, publisher = {Wiley Online}, doi = {10.1002/ieam.4128}, }
Bayesian networks in fusion systems often contain latent variables. They play an important role in fusion systems as they provide context which lead to better choices of data sources to fuse. Latent variables in Bayesian networks are mostly constructed by means of expert knowledge modelling.We propose using theory-driven structural equation modelling (SEM) to identify and structure latent variables in a Bayesian network. The linking of SEM and Bayesian networks is motivated by the fact that both methods can be shown to be causal models. We compare this approach to a data-driven approach where latent factors are induced by means of unsupervised learning. We identify appropriate metrics for URREF ontology criteria for both approaches.
@{204, author = {Alta de Waal, Keunyoung Yoo}, title = {Latent Variable Bayesian Networks Constructed Using Structural Equation Modelling}, abstract = {Bayesian networks in fusion systems often contain latent variables. They play an important role in fusion systems as they provide context which lead to better choices of data sources to fuse. Latent variables in Bayesian networks are mostly constructed by means of expert knowledge modelling.We propose using theory-driven structural equation modelling (SEM) to identify and structure latent variables in a Bayesian network. The linking of SEM and Bayesian networks is motivated by the fact that both methods can be shown to be causal models. We compare this approach to a data-driven approach where latent factors are induced by means of unsupervised learning. We identify appropriate metrics for URREF ontology criteria for both approaches.}, year = {2018}, journal = {2018 21st International Conference on Information Fusion (FUSION)}, pages = {688-695}, month = {10/07-13/07}, publisher = {IEEE}, isbn = {978-0-9964527-6-2}, url = {https://ieeexplore.ieee.org/abstract/document/8455240}, }
A significant challenge in ecological modelling is the lack of complete sets of high-quality data. This is especially true in the rhino poaching problem where data is incomplete. Although there are many poaching attacks, they can be spread over a vast surface area such as in the case of the Kruger National Park in South Africa, which is roughly the size of Israel. Bayesian networks are useful reasoning tools and can utilise expert knowledge when data is insufficient or sparse. Bayesian networks allow the modeller to incorporate data, expert knowledge, or any combination of the two. This flexibility of Bayesian networks makes them ideal for modelling complex ecological problems. In this paper an expert-driven model of the rhino poaching problem is presented. The development as well as the evaluation of the model is performed from an expert perspective. Independent expert evaluation is performed in the form of queries that test different scenarios. Structuring the rhino poaching problem as a causal network yields a framework that can be used to reason about the problem, as well as inform the modeller of the type of data that has to be gathered.
@article{191, author = {Alta de Waal, Hildegarde Koen, J.P de Villiers, Henk Roodt}, title = {An expert-driven causal model of the rhino poaching problem}, abstract = {A significant challenge in ecological modelling is the lack of complete sets of high-quality data. This is especially true in the rhino poaching problem where data is incomplete. Although there are many poaching attacks, they can be spread over a vast surface area such as in the case of the Kruger National Park in South Africa, which is roughly the size of Israel. Bayesian networks are useful reasoning tools and can utilise expert knowledge when data is insufficient or sparse. Bayesian networks allow the modeller to incorporate data, expert knowledge, or any combination of the two. This flexibility of Bayesian networks makes them ideal for modelling complex ecological problems. In this paper an expert-driven model of the rhino poaching problem is presented. The development as well as the evaluation of the model is performed from an expert perspective. Independent expert evaluation is performed in the form of queries that test different scenarios. Structuring the rhino poaching problem as a causal network yields a framework that can be used to reason about the problem, as well as inform the modeller of the type of data that has to be gathered.}, year = {2017}, journal = {Ecological Modelling}, volume = {347}, pages = {29-39}, publisher = {Elsevier}, isbn = {0304-3800}, url = {https://www.sciencedirect.com/science/article/pii/S0304380016307621}, }
Latest Research Publications:
ConceptCloud is a flexible interactive tool for exploring, vi- sualising, and analysing semi-structured data sets. It uses a combination of an intuitive tag cloud visualisation with an underlying concept lattice to provide a formal structure for navigation through a data set. Con- ceptCloud 2.0 extends the tool with an integrated map view to exploit the geolocation aspect of data. The tool’s implementation of exploratory search does not require prior knowledge of the structure of the data or compromise on scalability, and provides seamless navigation through the tag cloud and the map viewer.
@misc{227, author = {Tiaan Du Toit, Joshua Berndt, Katarina Britz, Bernd Fischer}, title = {ConceptCloud 2.0: Visualisation and exploration of geolocation-rich semi-structured data sets}, abstract = {ConceptCloud is a flexible interactive tool for exploring, vi- sualising, and analysing semi-structured data sets. It uses a combination of an intuitive tag cloud visualisation with an underlying concept lattice to provide a formal structure for navigation through a data set. Con- ceptCloud 2.0 extends the tool with an integrated map view to exploit the geolocation aspect of data. The tool’s implementation of exploratory search does not require prior knowledge of the structure of the data or compromise on scalability, and provides seamless navigation through the tag cloud and the map viewer.}, year = {2019}, journal = {ICFCA 2019 Conference and Workshops}, month = {06/2019}, publisher = {CEUR-WS}, isbn = {1613-0073}, url = {http://ceur-ws.org/Vol-2378/}, }
Semi-structured data sets such as product reviews or event log data are simultaneously becoming more widely used and growing ever larger. This paper describes ConceptCloud, a flexible interactive browser for semi-structured datasets, with a focus on the recent trend of implementing server-based architectures to accommodate ever growing datasets. ConceptCloud makes use of an intuitive tag cloud visualization viewer in combination with an underlying concept lattice to provide a formal structure for navigation through datasets without prior knowledge of the structure of the data or compromising scalability. This is achieved by implementing architectural changes to increase the system’s resource efficiency.
@{185, author = {Joshua Berndt, Bernd Fischer, Katarina Britz}, title = {Scaling the ConceptCloud browser to large semi-structured data sets}, abstract = {Semi-structured data sets such as product reviews or event log data are simultaneously becoming more widely used and growing ever larger. This paper describes ConceptCloud, a flexible interactive browser for semi-structured datasets, with a focus on the recent trend of implementing server-based architectures to accommodate ever growing datasets. ConceptCloud makes use of an intuitive tag cloud visualization viewer in combination with an underlying concept lattice to provide a formal structure for navigation through datasets without prior knowledge of the structure of the data or compromising scalability. This is achieved by implementing architectural changes to increase the system’s resource efficiency.}, year = {2018}, journal = {14th African Conference on Research in Computer Science and Applied Mathematics, Stellenbosch, South Africa, Proceedings}, pages = {276- 283}, month = {14/10-16/10}, publisher = {HAL archives-ouvertes}, url = {https://hal.inria.fr/hal-01881376}, }
Context: version control repositories contain a wealth of implicit information that can be used to answer many questions about a project’s development process. However, this information is not directly accessible in the repositories and must be extracted and visualized.
Objective: the main objective of this work is to develop a flexible and generic interactive visualization engine called ConceptCloud that supports exploratory search in version control repositories.
Method: ConceptCloud is a flexible, interactive browser for SVN and Git repositories. Its main novelty is the combination of an intuitive tag cloud visualization with an underlying concept lattice that provides a formal structure for navigation. ConceptCloud supports concurrent navigation in multiple linked but individually customizable tag clouds, which allows for multi-faceted repository browsing, and scriptable construction of unique visualizations.
Results: we describe the mathematical foundations and implementation of our approach and use ConceptCloud to quickly gain insight into the team structure and development process of three projects. We perform a user study to determine the usability of ConceptCloud. We show that untrained participants are able to answer historical questions about a software project better using ConceptCloud than using a linear list of commits.
Conclusion: ConceptCloud can be used to answer many difficult questions such as “What has happened in this project while I was away?” and “Which developers collaborate?”. Tag clouds generated from our approach provide a visualization in which version control data can be aggregated and explored interactively.
@article{174, author = {Bernd Fischer, M. Esterhuizen, G.J. Greene}, title = {Visualizing and Exploring Software Version Control Repositories using Interactive Tag Clouds over Formal Concept Lattices}, abstract = {Context: version control repositories contain a wealth of implicit information that can be used to answer many questions about a project’s development process. However, this information is not directly accessible in the repositories and must be extracted and visualized. Objective: the main objective of this work is to develop a flexible and generic interactive visualization engine called ConceptCloud that supports exploratory search in version control repositories. Method: ConceptCloud is a flexible, interactive browser for SVN and Git repositories. Its main novelty is the combination of an intuitive tag cloud visualization with an underlying concept lattice that provides a formal structure for navigation. ConceptCloud supports concurrent navigation in multiple linked but individually customizable tag clouds, which allows for multi-faceted repository browsing, and scriptable construction of unique visualizations. Results: we describe the mathematical foundations and implementation of our approach and use ConceptCloud to quickly gain insight into the team structure and development process of three projects. We perform a user study to determine the usability of ConceptCloud. We show that untrained participants are able to answer historical questions about a software project better using ConceptCloud than using a linear list of commits. Conclusion: ConceptCloud can be used to answer many difficult questions such as “What has happened in this project while I was away?” and “Which developers collaborate?”. Tag clouds generated from our approach provide a visualization in which version control data can be aggregated and explored interactively.}, year = {2017}, journal = {Elsevier}, volume = {87}, pages = {223-241}, issue = {2017}, url = {https://www.sciencedirect.com/science/article/pii/S0950584916304050?via%3Dihub}, }
Acquiring an overview of an unfamiliar discipline and exploring relevant papers and journals is often a laborious task for researchers. In this paper we show how exploratory search can be supported on a large collection of academic papers to allow users to answer complex scientometric questions which traditional retrieval approaches do not support optimally. We use our ConceptCloud browser, which makes use of a combination of concept lattices and tag clouds, to visually present academic publication data (specifically, the ACM Digital Library) in a browsable format that facilitates exploratory search. We augment this dataset with semantic categories, obtained through automatic keyphrase extraction from papers’ titles and abstracts, in order to provide the user with uniform keyphrases of the underlying data collection. We use the citations and references of papers to provide additional mechanisms for exploring relevant research by presenting aggregated reference and citation data not only for a single paper but also across topics, authors and journals, which is novel in our approach. We conduct a user study to evaluate our approach in which we asked 34 participants, from different academic backgrounds with varying degrees of research experience, to answer a variety of scientometric questions using our ConceptCloud browser. Participants were able to answer complex scientometric questions using our ConceptCloud browser with a mean correctness of 73%, with the user’s prior research experience having no statistically significant effect on the results.
@article{173, author = {Bernd Fischer, M. Dunaiski, G.J. Greene}, title = {Exploratory Search of Academic Publication and Citation Data using Interactive Tag Cloud Visualizations}, abstract = {Acquiring an overview of an unfamiliar discipline and exploring relevant papers and journals is often a laborious task for researchers. In this paper we show how exploratory search can be supported on a large collection of academic papers to allow users to answer complex scientometric questions which traditional retrieval approaches do not support optimally. We use our ConceptCloud browser, which makes use of a combination of concept lattices and tag clouds, to visually present academic publication data (specifically, the ACM Digital Library) in a browsable format that facilitates exploratory search. We augment this dataset with semantic categories, obtained through automatic keyphrase extraction from papers’ titles and abstracts, in order to provide the user with uniform keyphrases of the underlying data collection. We use the citations and references of papers to provide additional mechanisms for exploring relevant research by presenting aggregated reference and citation data not only for a single paper but also across topics, authors and journals, which is novel in our approach. We conduct a user study to evaluate our approach in which we asked 34 participants, from different academic backgrounds with varying degrees of research experience, to answer a variety of scientometric questions using our ConceptCloud browser. Participants were able to answer complex scientometric questions using our ConceptCloud browser with a mean correctness of 73%, with the user’s prior research experience having no statistically significant effect on the results.}, year = {2017}, journal = {Scientometrics (Springer)}, volume = {110}, pages = {1539-1571}, issue = {3}, address = {Netherlands}, isbn = {0138-9130}, url = {https://link.springer.com/article/10.1007%2Fs11192-016-2236-3}, }
No Abstract
@{141, author = {G.J. Greene, Bernd Fischer}, title = {CVExplorer: Identifying Candidate Developers by Mining and Exploring Their Open Source Contributions}, abstract = {No Abstract}, year = {2016}, journal = {Automated Software Engineering}, pages = {804-809}, month = {03/09-07/09}, isbn = {978-1-4503-3845-5}, }
Latest Research Publications: