notes

A First-Timer's Impressions of AAAI 2024 (w/ Friends)

research

March 03, 2024 10-minute read

Overview

AAAI 2024 🇨🇦 boasted 12,100 main-track submissions. Of these, 9,862 were reviewed and 2,342 (23.75%) were accepted. Based on recent trends, AAAI is one of the largest artificial intelligence and machine learning conferences, among NeurIPS and CVPR.

At AAAI 2024, what did researchers more senior (and smarter) than I care about? If one topic championed the conference, it was whether LLMs could reason.

There were many exciting trends in recent work. Inspired by Uri's ICASSP 2023 Top 10 Papers, I made a subjective list of 10 cool papers, which is substantially biased by my current interests.

I also had some suggestions to improve the conference experience.

Some of my senior undergraduate friends also attended AAAI 2024, and I asked them to share their thoughts and experience from attending the conference. Eric Enouen, Ryan Schuerkamp, and I interned together at CMU RISS 2023 . Furong Jia and I are both seniors at USC Viterbi .

LLMs, World Modeling, Planning, and Social Impacts

LLMs and world modeling, planning, and social impacts were discussed in: (Tutorial) LMs Meet World Models, (Tutorial) LLMs in Planning, and (Panel) Implications of LLMs.

To compare perspectives, I jotted down the positions that speakers expressed during these sessions:

Speaker	World Modeling with LLMs	Utility of Symbolic Reasoning	Reasoning in the Future
Manning	--	--	LLMs will get immensely better, but semantics will become more contextual
Mcilraith	--	Useful for epistemics vs beliefs	Plurality of LLM and symbolic reasoning methods across tasks
Sutton	--	Declining utility	Hybrid of LLM and symbolic reasoning methods
Kambhampati	Yes if HITL	Useful as content or style critic	LLM-Modulo: LLM as a plan generator and a critic to filter out plans
Hu	Yes	--	LLM as the backend of an agent model: LLM-approximated rewards, beliefs, or world models

Table 1: Summary of positions from the tutorials and the panel.

Consensus on LLMs as System-I Thinking

As expected, speakers generally agreed that LLMs resemble System-I (intuitive, reflexive) thinking more than System-II (analytic, deliberative) thinking. Symbolic reasoning and formal logic are more similar to System-II thinking and might help form hybrid systems with LLMs (e.g., Kambhampati's LLM-Modulo framework that was brought up so frequently I lost count).

Disagreement on Symbolic Reasoning's Future Role

Surprisingly, speakers disagreed on symbolic reasoning's future role.

Manning noted that LLMs are improving at a tremendous rate. Due to their increasing popularity, he conceded that crisp semantics from symbolic reasoning may no longer be the default in the future.

Sutton expressed that hybrid methods may be beneficial, but that he also frequently changes his mind about the topic. He shared a reason for his hesitation:

Formal logic has had a two thousand year history of overpromising and underdelivering ... think to Aristotle. Syllogisms were introduced as a normative method of human reasoning. And yet no one in the history of human ideas has convinced anyone of anything using a syllogism.

Mcilraith was optimistic about classical knowledge representation methods, which she believed to complement LLM systems for safety-critical applications. She added that these methods will be helpful for agents tasked with differentiating between epistemics and beliefs.

Kambhampati emphatically disagreed that LLMs can reason, so human or formal logic critics ought to be in the loop. In his tutorial, he expressed that LLM reasoning abilities are conflated with pattern matching due to confirmation bias, or when humans are in the loop there is a strong possibility of the Clever-Hans effect. In fact, he repeatedly declared:

In the eternal words of Sam Altman: 'For some definition of reasoning, it can do some kind of reasoning.'

Social Impacts: Privacy, the Environment, Trust and Misinformation, Labor, and Alignment

Each panelist was asked about a particular sphere of social impact.

Mcilraith expressed her concern about data privacy harms and environmental harms from LLM development. She also emphasized the need to protect public trust in electoral and banking systems, further elaborating that as stewards of AI technology, we are responsible to consider and deter these negative impacts.

Kambhampati commented on misinformation, and he suggested that in the post-LLM age the messengers would be more important than the message.

Sutton speculated on the impacts of LLMs on the labor market. He suggested that LLM-guided automation has positive effects, making an analogy to how code compilers increased prevalence of software engineering jobs. Considering job safety, Sutton reasoned that LLMs would disrupt white-collar work more than blue-collar work, so the the safest job might be one like plumbing.

When asked who to align to in value alignment, Manning commented that alignment already has a long history in the form of goal-oriented regex matching for dialogue systems. He raised his doubts about the recent popularity of effective altruism groups, who strike him as sophists who prefer complex arguments and counterfactual worlds to caring about other human beings. Instead, the correct approach is to involve the broader spectrum of humanity as stakeholders of our technology.

There's been this sort of enormous growth of this field of AI alignment, which has been dominated by this futuristic, science-fiction, fantasy, effective altruist thinking. And I don't really see that as the right path forward.

Inspiration: 10 Cool Papers

Unfortunately, I did not attend all 2,342 paper presentations. But from what I did attend, there were a number of papers I thought did a good job prodding at deep questions and open challenges: these are 10 of them.

Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties. (Sorensen et al.)

Wow, I am just blown away. This work by UW, AI2, ++ considers the pluralistic human values, rights, and duties across 31K human-written situations. The researchers are able to reconcile simulated value generation / explanation / relevance / valence with the judgement of a diverse set of 600+ human annotators. They show that a small model (3B parameters) fine-tuned from the dataset can outperform GPT-4 on the ETHICS benchmark. Also, they show how human values are sensitive to subtle context changes as well as political alignment!

SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models. (Nagireddy et al.)

This great work by IBM Research explores a list of 93 social stigmas by constructing a QA benchmark. As opposed to prior work, their benchmark includes a broader set of stigmas and also evaluates bias as a QA task. Interestingly, the work highlights that most extreme biases center "around sex (e.g., sex offender, having sex for money, genital herpes, HIV) and drug consumption (e.g., drug dealing, cocaine use recreationally)," suggesting that more social biases ought to be explored!

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind. (Mao et al.)

Neural ToM research is on the rise! This work from East China Normal University constructs yet another video ToM benchmark. Their benchmark contains 3K+ videos, an order of magnitude above the 100+ videos in the earlier MMToM-QA benchmark. Unfortunately, this larger scale of data seems to come at the cost of writing richer QA pairs.

ProAgent: Building Proactive Cooperative Agents with Large Language Models. (Zhang et al.)

This massive collaboration, led by Peking University, explores building an LLM agent for the collaborative multi-agent Overcooked-AI benchmark. Their agent framework comprises of planning, verifying, memory, and belief correction modules. Interestingly, most information appears to be represented in natural (or programmatic) language, and the modules are built from LLMs!

Human-Guided Moral Decision Making in Text-Based Games. (Shi et al.)

This work by the University of Technology Sydney incorporates human feedback to improve morality of RL agents tasked to complete text-based games in the Jiminy Cricket benchmark. Their agent can act morally while completing the games successfully, which are objectives that are sometimes at odds. Cool work!

Visual Instruction Tuning with Polite Flamingo. (Chen et al.)

Are there downsides to multimodal alignment? This work by HKUST thinks so! The researchers propose the multimodal alignment tax, which occurs when raw human-written annotations (which can be straightforward or brief) are used for instruction tuning (e.g., in InstructBLIP). To counter the multimodal alignment tax, they train a multimodal style transfer model for increasing politeness in text annotations. Their model is impressive at image reasoning, and polite, too!

Detecting and Preventing Hallucinations in Large Vision Language Models. (Gunjal et al.)

This work by Scale AI presents a image captioning dataset with dense accurate / inaccurate / subjective annotations to reduce hallucinations in entity descriptions and relationships. A reward model trained on their dataset can be used for best-of-n rejection sampling, which decreases hallucinations by 55%. This is the first paper I've seen that's come out of Scale AI -- very cool!

CoVR: Learning Composed Video Retrieval from Web Video Captions. (Ventura et al.)

Great work showing how to create useful synthetic data for training! Researchers from École des Ponts explore automatically generating synthetic triplets (two videos and one modification instruction text) for video retrieval. They train a vision-language model that describes the modification from one video to another, enabling the creation of a dataset to collect a dataset of 1.6M triplets. Fine-tuning with their dataset increases image retrieval ability.

Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis. (Fan et al.)

LLMs still lack rationality ... in a game theory context! This work by Shanghai Jiao Tong University examines LLMs in the GPT family. Based on dictator, rock-paper-scissors, and ring-network games, they find that LLMs lack rationality (they "struggle to build desires based on uncommon preferences, fail to refine belief from many simple patterns, and may overlook or modify refined belief when taking actions").

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding. (Lu et al.)

This work by the University of Electronic Science and Technology of China introduces an object localization task using human gestures and natural language descriptions in 3D point cloud environments. The intuition is that human gestures often disambiguate objects of interest, a good assumption to make in the real world!

Criticisms

Of course, it is a massive challenge to organize a conference of this scale, so enormous kudos to the organizers. But AAAI has been around for ~40 years, and I'd hope that some of these issues can be solved before we've achieved AGI.

Oral and Poster Sessions

Sessions were too broad (e.g., "Machine Learning 1, 2, ...").
Poster placement were grouped by submission order rather than by topic.
(Eric) Poster sessions were scheduled too late in the day.

Since I normally go to bed early on Eastern time anyway, it was very difficult to try and stay awake all day for the late poster sessions (7-9).

Technical session information was spread too thin across four platforms, rather than on a centralized platform.
The lighting in some rooms obscured the slides.

Quality of Life

QR codes on the registration badges should have been much larger (why the empty space?).
For events that run from 8am-9pm, I really do need bottomless coffee.

Undergraduate Questionnaire

Ever wondered what undergraduates might think of a conference trip? Here you go.

Disclaimer: course statistics and human error courtesy of false recall. For privacy, survey participants are de-identified and shuffled.

A. Oral Presentation Quality: How would you rate the quality of the oral presentations on a scale of 1-10? (1 being poor, 10 being excellent)

B. Poster Presentation Quality: How would you rate the quality of the poster presentations on a scale of 1-10? (1 being poor, 10 being excellent)

C. Community: Approximately, how many new people did you meet?

Acknowledgements

Thanks to Eric Enouen, Ryan Schuerkamp, and Furong Jia for their contributions to the questionnaire.

Thanks to Maxwell Forbes for the inspiration to start a blog. Over time, I'll probably to transition toward garage/evergreen-notes style content.

Thanks to Uri Nieto for the inspiration on conference impressions.

Footnotes

Current interests include language grounding in vision, social inference and reasoning, and learning from synthetic data.

(Tutorial) LMs Meet World Models. Presenter: Zhiting Hu. Hu suggested that humans reason with world models (state transition mapping) as agent models (planning system following a goal, using a world model and beliefs approximating the world state and other agent beliefs). One theme spanned the tutorial: LMs should serve as a backend of Agent and World models (LAW), and the LAW framework could improve language, embodied, and social reasoning tasks. To improve on the LAW framework, Hu expressed support for active learning from embodied and social experiences, leveraging multimodality, reasoning in the latent space, and tool use.

(Tutorial) LLMs in Planning. Presenter: Subbarao Kambhampati. Kambhampati emphatically posits that LLMs do not reason. While they are deceptively great for some tasks, these abilities can be attributed to data leakage or pattern matching. There is also a confirmation bias for when hallucinations align with our beliefs. However, LLMs are not useless for reasoning: instead, Kambhampati advocates that we use LLMs in LLM-Modulo frameworks. In LLM-Modulo frameworks, LLMs are used as approximate knowledge sources by generating candidate plans. An external critic verifies the plans, optionally building a synthetic fine-tuning dataset to improve the LLM, or presents its criticism as a backprompt to the LLM.

(Panel) Implications of LLMs. Panelists: Christopher Manning, Sheila Mcilraith, Charles Sutton, Subbarao Kambhampati. The panel opened with a discussion on interplays between symbolic reasoning and LLMs. Then, panelists commented on the spheres of society where LLMs might have the largest impact: environmental harms, misinformation, labor, and alignment.

In BDIQA, a QA pair looks like, "What are her intentions?": "(a) Fetch food, cook food. (b) Switch off TV." versus MMTOM-QA's QA pair, "If Elizabeth has been trying to get a bottle of wine, which one of the following statements is more likely to be true?": "(a) Elizabeth thinks that there is a bottle of wine inside the fridge. (b) Elizabeth thinks that there isn't a bottle of wine inside the fridge."

Or Yann LeCun's suggested acronym that he pitched in his keynote, AMI (Advanced Machine Intelligence).

The Whova App had technical session titles and times, but not the individual presentation titles. One PDF had information on oral presentations, and three PDFs each had information about the respective poster presentation day.

After the first few days, I avoided Ballroom D at all costs. At most times during the day, the sun was directly pointed at the projection screen and obscured the slides.

Subsequent notes

Catching Up: Backlogged Impressions of NAACL 2024 August 23, 2024 What is up w/ Japanese Zoos?! April 26, 2024