Mining Reliable Information from Crowdsourced Data
- Jing Gao, University at Buffalo
- Yaqing Wang. PhD Student.
- Tianqi Wang. PhD Student.
- Rui Li. PhD Student.
- Qi Li. Assistant Professor, Iowa State University.
- Yaliang Li. Senior Engineer, Alibaba.
- Houping Xiao. Assistant Professor, Georgia State University.
- Fenglong Ma. Assistant Professor, Penn State University.
- Wendy Shi. Graduate Student, University of Illinois.
This website is based upon work supported by the National Science Foundation under Grant No. IIS-1553411. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
With the proliferation of mobile devices and social media platforms, any person can publicize observations about any activity, event or object anywhere and at any time. The confluence of these enormous crowdsourced data can contribute to an inexpensive, sustainable and large-scale decision system that has never been possible before. Such a system could vastly improve the efficiency and cost of transportation, healthcare, and many other applications. The main obstacle in building such a system lies in the problem of information veracity, i.e., individual users might provide unreliable or even misleading information. This project identifies important research questions in the task of mining reliable information from noisy and unreliable crowdsourced data, and pursues an integrated research and education plan to address these questions. Through integrating data from various sources, this project addresses information veracity, which will benefit the many applications where crowdsourced data are ubiquitous but veracity can be suspect.
In particular, this project develops novel methods to mine reliable information by taking into consideration various properties of crowdsourcing: 1) Crowdsourcing platforms collect users' observations about certain objects. Other valuable information sources, such as spatial-temporal, user influence, and textual data, are leveraged to effectively detect reliable information from these observations. 2) Effective privacy protection and budget allocation mechanisms are designed to better motivate active crowdsourcing. These investigations are integrated with the exploration of both theoretical and practical aspects of the proposed methods. From the theoretical perspective, fundamental questions regarding the confidence in the estimated reliability and the convergence of the proposed methods are explored. From the practical perspective, the proposed methods are adapted to tackle challenging problems in various applications such as transportation, healthcare and education to enable new insights into these domains. In addition to the research advances, this project contributes to educational innovation, as the proposed methods are applied to educational methodologies such as peer assessment and question answering.
ICDCS20 |
Towards Differentially Private Truth Discovery for Crowd Sensing Systems. International Conference on Distributed Computing Systems, Singapore, June 2020, to appear. |
SDM20 |
Rare Disease Prediction by Generating Quality-Assured Electronic Health Records. SIAM Conference on Data Mining, Cincinnati, OH, May 2020, to appear. |
INS |
Multi-source Data Repairing Powered by Integrity Constraints and Source Reliability. Information Sciences, Vol.507, pp.386-403, January 2020. |
BigData19 |
IProWA: A Novel Probabilistic Graphical Model for Crowdsourcing Aggregation. IEEE International Conference on Big Data, Los Angeles, CA, December 2019, 677-682. |
BigData19 |
Online Federated Multitask Learning. IEEE International Conference on Big Data, Los Angeles, CA, December 2019, 215-220. |
IJCAI19 |
Metric Learning on Healthcare Data with Incomplete Modalities. International Joint Conference on Artificial Intelligence, Macao, China, August 2019, 3534-3540. |
EDM19 |
Deep Hierarchical Knowledge Tracing. International Conference on Educational Data Mining, Montreal, Canada, July 2019, 671-674. |
EDM19 |
Improving Peer Assessment Accuracy by Incorporating Relative Peer Grades. International Conference on Educational Data Mining, Montreal, Canada, July 2019, 450-455. |
KBS |
PatternFinder: Pattern discovery for truth discovery. Knowledge-Based Systems, Vol.176, pp.97-109, July 2019. |
SDM19 |
DTEC: Distance Transformation Based Early Time Series Classification. SIAM Conference on Data Mining, Calgary, Canada, May 2019, 486-494. |
WWW19 |
MCVAE: Margin-based Conditional Variational Autoencoder for Relation Classification and Pattern Generation. The Web Conference, San Francisco, CA, May 2019, 3041-3048. |
TKDE |
Towards Confidence Interval Estimation in Truth Discovery. IEEE Transactions on Knowledge and Data Engineering, 31(3): 575-588, March 2019. |
CIKM18 |
KAME: Knowledge-based Attention Model for Diagnosis Prediction in Healthcare. ACM International Conference on Information and Knowledge Management, Turin, Italy, October 2018, 743-752. |
ASONAM18 |
Leveraging the Power of Informative Users for Local Event Detection. IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, Barcelona, Spain, August 2018, 429-436. |
KDD18 |
An Efficient Two-Layer Mechanism for Privacy-Preserving Truth Discovery. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 1705-1714. |
KDD18 |
Risk Prediction on Electronic Health Records with Prior Medical Knowledge. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 1910-1919. |
KDD18 |
TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 2729-2737. |
SDM18 |
Uncorrelated Patient Similarity Learning. SIAM Conference on Data Mining, San Diego, CA, May 2018, 270-278. |
WWW18 |
Attack under Disguise: An Intelligent Data Poisoning Attack Mechanism in Crowdsourcing. The Web Conference, Lyon, France, April 2018, 13-22. |
BigData17 |
Travel Purpose Inference with GPS Trajectories, POIs, and Geotagged Social Media Data. IEEE Conference on Big Data, Boston, MA, December 2017, 1319-1324. |
SIGSPATIAL17 |
City-wide Traffic Volume Inference with Loop Detector Data and Taxi Trajectories. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, November 2017, 1:1-1:10. |
KDD17 |
Unsupervised Discovery of Drug Side-Effects from Heterogeneous Data Sources. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 967-976. |
KDD17 |
Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 1903-1911. |
WSDM17 |
Reliable Medical Diagnosis from Crowdsourcing: Discover Trustworthy Answers from Non-Experts. ACM International Conference on Web Search and Data Mining, Cambridge, UK, February 2017, 253-261. |
CIKM16 |
Influence-Aware Truth Discovery. ACM International Conference on Information and Knowledge Management, Indianapolis, IN, October 2016, 851-860. |
TBD |
Extracting Medical Knowledge from Crowdsourced Question Answering Website. IEEE Transactions on Big Data, accepted, September 2016. |
KDD16 |
A Truth Discovery Approach with Theoretical Guarantee. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 1925-1934. |
KDD16 |
Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1935-1944. |
- UB CSE 469: Introduction to Data Mining
- UB CSE 601: Data Mining and Bioinformatics
- UB CSE 706: Selected Topics in Data Mining
KDD19 |
Optimize the Wisdom of the Crowd: Inference, Learning, and Teaching. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AL, August 2019. |
- OFMTL: "Online Federated Multitask Learning" in [BigData19]
- Two-Layer Perturbation Approach: "Privacy Preserving Truth Discovery" in [KDD18]
- Intelligent Attack: "Data Poisoning Attack against Crowdsourcing Aggregation" in [WWW18]
Last updated: May 2020.