The place to explore the latest technical reports and new application fields
The KDD conference is known to be more applied compared to other A tier ML conferences.
For this reason, it is a great place to go to catchup on the latest technical trends but also discover emerging new topics.
In this post, I’ll present the top ideas that interested, surprised or at least felt novel to me.
📝 — Detecting the AI pen
This workshop was about detecting text generated by AI.
It can have many different motivation but a major one is that AI can create more easily misinformation that we want to prevent.
The main take-aways :
- Soft watermarks may be the most efficient to detect AI writing. Zero shot would be second.
- But you need long chunks of text. With short ones, even human can be detected as AI.
Some examples
Zero-shot detection — Detect GPT
We can do a lot based on the distribution of the text generated. Usually human don’t produce text with the highest probability and variations of their text has also different properties.
Soft Watermark — The green / red algorithm
- Split all words in 2 groups
- inference is possible with only 1 of the 2 sets
- detection is done on the probability to have words only in 1 of the 2 sets
- hypothesis testing is used to detect if a set of text with a lot of red words is AI
There can be different green-red lists that give better results
Why is it called “soft” ?
It is not always possible to green/red everything like the word “Obama”, as it has no synonym.
Efficiency is only reached on longer texts. In that scenario, it starts to be very hard for attacked to remove all traces of AI-ness.
🕴 — AI impact on the job market, talent management and recruiting
This workshop explore how new AI capabilities could shift the global job market. The topics discussed were in fact very wide.
>> Human learning direction
The first lecture from Professor Hui Xiong highlighted that an area of knowledge worker could quickly have a much lower added value : the describable knowledge.
None describable knowledge is easier to illustrate : people management was the first illustration given.
>> LLM as novice qualitative research assistant, a talk from Talent management research from Amazon
Talent management research is the usage of science and data to equip employees with resource to best navigate their career.
Talent management works on either core research (what does promotion / good employee / etc looks like), product development or metrics and evaluation.
Their base material is an interview dataset. They don’t reveal their internal dataset but used a public one composed of 8 transcripts of 1 hour interview.
A RAG based model achieves close to human performance.
The key takeaway is especially that they consider the level of the RAG LLM at the level of a junior qualitative researcher.
But more realistically, the tool will mainly boost the work of the human more than replacing it. They mentioned the lack of replacement for bias and deduction from LLM.
🏭 — Preprocessing large multimodal dataset
Another great overview of the first day of this conference is the discovery of the data-juicer package.
It is aimed at preprocessing very large amount of multimodal data for LLM training. The maintainer explained the key differences that pushed them to develop a tool different than say Spark :
- A model is an operator like any other one
- The Data is AI-native, meaning it is intended for AI primarily. An example could be filtering partially a video
I recommend to read the excellent blog post from HuggingFace on how they build the FineWeb dataset. Many of the custom operator mentioned in their post are present in DataJuicer.
The scale and cost of the pre-processing seems new to me as they were maybe more limited to text for the majority of companies.
💡 — Conclusion
This first day was great in discoveries. I would recommend to attend multiple different session as you can often make random discoveries on topics that could interest you.