Analytics Used to Detect Online Harassment - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
Commentary
2/7/2017
03:30 PM
Jessica Davis
Jessica Davis
Commentary
Connect Directly
Twitter
RSS
50%
50%

Analytics Used to Detect Online Harassment

In conjunction with Internet Safety Day, the Wikimedia Foundation has released two new public data sets of online harassment in Wikipedia edits. The Foundation leveraged machine learning to detect harassment.

The internet can spread ideas, connect communities, and serve as the foundation for thriving businesses. But so many people who use the internet or social media know that it is also home to trolls, harassers, "doxers," and others with less noble intentions, as we are reminded today on Internet Safety Day.

For example, here's one of the things an anonymous poster said to a woman editor on Wikipedia in March 2015: "What you need to understand as you are doing the ironing is that Wikipedia is no place for a woman."

That post was left on one of Wikipedia's "talk pages," which are pages attached to every Wikipedia article and user page on the platform. It demonstrates that these discussions are not always good-faith collaboration and exchange of ideas.

(Image: Pixabay)

(Image: Pixabay)

In conjunction with attention to this problem and defenses against it, the Wikimedia Foundation has released two large data sets to the public. The first set is a collection of over one million annotations of Wikipedia talk page edits from 4,000 crowd workers to determine whether each edit was a personal attack and who was the target of each attack. Each edit was rated by 10 judges whose opinions were aggregated and used to train the model. The Wikimedia Foundation said it believes this is the largest public annotated data set of personal attacks available today.

The second data set is all 95 million user and article talk comments made between 2001 and 2015. Both of these data sets are available to the public here to support further research.

Wikipedia said that the model was inspired by research at Yahoo that was designed to detect abusive language by using fragments of text from the Wikipedia edits and feeding them into a simple machine learning algorithm for logistic regression.

"The model this produces gives a probability estimate of whether an edit is a personal attack," the Wikimedia Foundation said in a statement announcing the data set availability. "What surprised us was the effectiveness of this model: a fully trained model achieves better performance than the combined average of three human crowd workers."

The research also revealed the following insights about online harassment of Wikipedia editors:

  • Only 18% of attacks that the algorithm discovered were followed by a warning or block of the offending user.
  • While anonymous users are responsible for a disproportionate number of attacks, registered users still account for almost 67% of the attacks on Wikipedia.
  • While half of all attacks come from editors who make fewer than five edits per year, a full one-third of attacks come from registered users with more than 100 edits per year.

There's still plenty of work to be done. While the researchers now understand more about this kind of behavior, there's still plenty of work needed to learn the best ways to mitigate the behavior. Also, the data is currently only in English, and so the model only understands English. The Wikimedia Foundation acknowledges that the model is still not very good about identifying threats.

The Wikimedia Foundation worked with Jigsaw, Alphabet's technology incubator, on this research, and invites others to join the future research efforts by getting in touch via the project's wiki page.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
Commentary
Study Proposes 5 Primary Traits of Innovation Leaders
Joao-Pierre S. Ruth, Senior Writer,  11/8/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll