Better Bridges Between Model and Real World

To build better machine learning solutions, we need not only better models but also better bridges through the inputs and outputs to the real world challenges they aim to solve. On the input side, these bridges are the tools one has to work with the available data. On the output side, they are the tools to evaluate and ensure models will generalize and work as intended. This project aims to improve our understanding and resources on both sides, with a particular focus on social good applications. So far, progress has been made in 6 contexts, such as countering misinformation and human trafficking. Future work aims to both build on these specific contexts, as well as leverage their interconnections to produce insights and tools with wide applicability.


Introduction
Machine learning has transformed science and society, from everyday uses like powering smartphone apps or web search to cutting edge research in almost all fields.However, there are still many unsolved societal challenges and many cases where machine learning systems fail to deliver the desired results.For instance, recent generative language models are prone to confident but false statements.In order to efficiently create better solutions for social good applications like mitigating misinformation, we need not only better models but also stronger ties between models and the real world problems they aim to solve.First, on the input side, we need better methods to leverage the full breadth of available data, which could come in many types, from many different sources, and with varying quality.This will enable us to maximize the data we are able to use and find the data that will lead to the best generalization, even while adapting to practical constraints like varied availability of different data types and qualities.Second, on the output side, we need better evaluation and improved reliability.This will ensure that success in laboratory conditions will generalize and translate to success in the real world, and our systems will not suffer unexpected flaws that limit effectiveness or outright cause harm.Overall, by focusing on these key but often less explored areas where models connect to their real-world tasks, this project aims to deliver practical tools and insights.
The project aims to deliver progress here in two phases.In the first, which has been underway since late 2020, the project targets specific applications and domains.These include, so far, misinformation, political polarization and party prediction, temporal graphs, human trafficking, and robustness of Go-playing systems.The goal here is both to produce meaningful advancements in these particular areas, and to learn patterns that might generalize across domains.The latter feeds into the second phase of the project, which is to produce general insights which can improve the ties between a broad array of ML systems and their real world goals.Work here is in the early stages.The first preliminary conclusion is the importance of testing simple models and baselines, which can provide surprisingly strong performance.The next step here will be to create and leverage tools to work with signed graphs, which will benefit multiple applications.

Phase 1: Specific Applications and Domains
In this section, we discuss research completed and in progress, and how it fits into this overall project.Note that some of the work here is led by coauthors advancing other research projects, but it nonetheless contributes to this one as well.
In [1], we examined the use of standard language models (e.g.BERT [2] and RoBERTa [3]) for misinformation detection, finding they yielded surprisingly strong performance competitive with or even exceeding state-of-the-art domain-specific models on known conspiracies.This is an important real-world task (alongside detecting misinformation on new, unseen topics), and the most common evaluation setting, but in prior work the state-of-the-art models had generally not been compared with such language models.Thus, there was a hole in the evaluation process.These results also have implications for the usage of different data types, showing text content generalizes effectively to new examples of known misinformation, while other modalities are likely needed to generalize to new misinformation.Furthermore, we found flaws in commonly used datasets, with their collection process creating unintentional differences between examples of different classes, which made distinguishing the classes unrealistically easy.For instance, in the Twitter16 dataset [4,5], classifying the date alone can give over 90% F1.But in real applications, just the date is never sufficient to identify misinformation, and in many cases may not be relevant at all.In work in progress, we are expanding this search for flaws in datasets and other practices, searching for and hopefully solving more issues in order to shrink the gap between research and practical, real world misinformation mitigation.
Another area where comprehensive evaluation was missing is party prediction for social media users.This is a fundamental task for a great deal of research, such as studying political polarization, the impact of bots, and the spread of misinformation.In all these examples, to understand interactions and conflicts between different groups, one has to first identify the groups.Despite the development of a substantial number of approaches for this task, comparative evaluation of their performance was limited or in many cases nonexistent.[6] aims to rectify this, comparing strong existing approaches (e.g., [7,8]) and simple but effective new ones.The comparison includes not only the accuracy, but also other key practical considerations such as data availability and computation speed.Some of the new methods also provide the ability to use new data types that give performance comparable to previously used ones, expanding the options available to future researchers.Work in progress aims to expand this research with new scope of data (e.g., countries with more complex political party systems), transfer (e.g., learn on a country with high data availability, predict on another where data is more limited), and new tools and understanding to combine text and graph data.
Link prediction on temporal graphs also had severe limitations in evaluation.Here the standard evaluation framework led to multiple approaches achieving nearly perfect performance.This meant it was very challenging to determine the best approach or if a new one was better than existing ones.Moreover, such an optimistic performance assessment was often unrealistic, considering that even for many static graphs link prediction is not yet solved.[9] provided key tools for more informative and realistic evaluation.These included new datasets with greater diversity, more challenging negative sampling techniques that can also better match real-world tasks, new exploratory tools, and a new simple baseline with competitive performance.
Measuring political polarization presents a different evaluation challenge: the lack of a definitive ground truth.There is no single, concrete measure to compare with.In fact, there is no single definition either, and in turn no predetermined input data.We have approached this problem from two angles.First, focusing on text data, [10] proposed a method to investigate specific topics (here COVID-19, and especially discussion of lockdowns, vaccines, and masks).By incorporating geographical data, we were able to investigate individual US states (and in work in progress, Canadian regions) and validate our measure by comparing with external COVID-19 measures like case, vaccination, and death rates.Second, focusing on network interactions, work in progress proposes a method to combine multiple interaction types to measure polarization and unusual activity around the 2020 US election.We validate this measure by comparing with key real-world events, survey data, and news data.Both studies provide new methods to handle complex, varied inputs for this task, and ways to combine multiple types of evaluation to validate results in a setting with no ground truth available.
Another line of work aims to extract named entities from online escort advertisements.Finding patterns in entities across multiple ads can help detect human trafficking.However, unlike typical named entity recognition, the data here is noisy and may even be intentionally obfuscated to avoid detection.In addition, the data is often sensitive and private, making it challenging to obtain large amounts of labeled data.To overcome these challenges, [11] combined regex rules, a dictionary extractor, and a disambiguation module to extract person names.Validation was done starting from crowdsourced labels, with manual corrections for some ambiguous cases, and both standard and domain-specific metrics.The new method outperformed existing state-of-the-art for this domain in both types of metrics.Work in progress aims to extend this to other types of entities (locations and social media links) and approach the problem in a multitask framework to leverage the connections in the output data (e.g., person names might be part of social media links too, but are unlikely to be part of a location).We also aim to improve the connections between the different components using a weak supervision framework, particularly Skweak [12].
Finally, the last application studied so far is the robustness of systems that play the board game Go.These systems are strongly superhuman, are trained in an adversarial self-play process (typically based on [13]), and use considerable amounts of Monte Carlo Tree Search (MCTS) to simulate the likely future progression of the game.One might hope that these factors will lead to very reliable performance.However, in [14], we show that these systems still harbor severe weaknesses and can make game-losing mistakes that much weaker humans would not.These mistakes can also be induced by humans without algorithmic assistance, and transfer in a zero-shot setting to different Go systems.This study highlights how thorough testing and other tools to ensure robustness and reliability are crucial to tie machine learning systems to the real world, without unexpected and potentially catastrophic failures.

Phase 2: General Tools and Insights
The work above illustrates challenges and solutions, across varied contexts, to better tie ML systems to the real world tasks they are meant to solve.To go beyond specific contexts and provide more widely applicable solutions, the second phase of this project aims to find and build on patterns and connections between the works above.
One such connection is the importance of comparison with generic, simple models and baselines.In three works [1,6,9], we found that such approaches can sometimes even outperform far more complex, domainspecific state-of-the-art models.The primary cause is simply a lack of comparisons, either in general, or with the strongest possible baselines available.For instance, in [1], we found that many studies compared their model with text-only baselines.But they were typically older ones such as Naive Bayes or SVM-based classifiers, which did not match the performance of more recent transformer-based approaches.On the other hand, sometimes there might not be standard baselines, but a simple, "naive" approach can provide one.For example, for temporal graphs, [9] showed that just memorizing every edge seen provides a valuable and surprisingly strong baseline.Overall, more thorough comparison with simple models will lead to more conclusive improvements in complex ones, as well as sometimes directly providing more efficient approaches and other insights.
Another connection is the potential of signed graphs.In social media data, it is relatively easy to tell when a user interacts with another user (e.g., replying), or discusses a topic or particular entity (for example, one can use topic modeling or keywords).However, analyzing the quality and intent of that interaction is much more challenging.For example, when two users interact, do they agree?When a user mentions a political issue or figure, are they indicating support or opposition?These examples correspond to the task of predicting signs for edges in a graph.Although there has been research aimed at answering these questions, there is no definitive solution.But such a solution would enable much deeper insights into social media interactions and discussion.For instance, knowing exactly when users agree or disagree would lead to much more accurate and informative party prediction and polarization measurements.Many misinformation datasets like [15] collect posts about fact-checked false topics, but cannot differentiate between posts that endorse and spread the misinformation from ones that oppose it.If the signs of interactions were known, these datasets could be refined and lead to much more accurate misinformation detection methods.We could also build a much better understanding of discussion around contested political issues like abortion and climate change, and how that discussion evolves over time.
By combining existing approaches such as [16][17][18], we aim to create a strong method to predict the signs of different interactions.Then we will use this method in the applications discussed above.In addition, signed temporal graph datasets and methods are very limited.This work should produce at least one such dataset (when examining the evolution of discussions over time), and will hopefully produce a new method to use signed temporal graph data that outperforms existing ones.

Conclusion
In order to produce transformative and reliable ML systems with positive social impact, we need to strengthen the connections between our models and the real world.These connections are formed on both the input and output sides.This project aims to improve both, first by working to solve specific but varied challenges, then from that work synthesizing broadly-applicable tools and insights.So far, progress has been made in six specific areas, with a particular focus on social good applications.This has in turn led to one general insight from repeated observations across several areas (the importance of simple but thorough baselines), and a concrete new direction (signed graphs) that has the potential to advance a number of these areas simultaneously.In the upcoming months, we aim to realize the work planned and underway, as well as to find new connections and general tools.Overall, the project aims to provide specific solutions to critical societal challenges of today like misinformation and human trafficking, as well as generally applicable approaches that will help solve the challenges of tomorrow.