Anti-piracy Organization Takes Down AI Training Dataset 'Books3' Used by Meta's Large Models


The New York Times and the Daily News encountered an unexpected twist in their copyright lawsuit: an OpenAI engineer inadvertently deleted virtual machine search data that could have been key evidence, adding a dramatic turn to this high-profile legal dispute. According to a letter submitted to the U.S. District Court for the Southern District of New York on Wednesday night, lawyers and technical experts for the two media companies had previously invested over 150 hours searching OpenAI's AI training dataset. However, on November 14, an OpenAI engineer accidentally deleted data stored on the virtual machine.
LAION launched Re-LAION-5B, the world's first AI training dataset that fully removes links to CSAM, aimed at addressing the issue of Child Sexual Abuse Material (CSAM). This dataset has been significantly improved over LAION-5B and is mainly divided into two versions: Re-LAION-5B Research and Research-Safe. A total of 2,236 CSAM links have been removed, including 1,008 from child protection organizations' lists. The dataset contains 5.5 billion pairs of text and images, designed to help
Recently, the 'Diting' seismic wave large model, jointly developed by the National Supercomputing Center in Chengdu, the Institute of Geophysics of the China Earthquake Administration, and Tsinghua University, was officially released in Chengdu, Sichuan. This model is the first seismic wave large model in the country to reach 100 million parameters, marking a significant breakthrough in the integration of seismology research and artificial intelligence technology in China.
Google DeepMind CEO Hassabis predicts that Artificial General Intelligence (AGI) could appear as early as 2029-2030, with key breakthroughs potentially achieved within the next three years. He points out that increased investment by tech companies is accelerating the maturity of core technologies such as multimodal understanding, autonomous decision-making, and AI agents.
The Variable Robot team released the world's first embodied intelligence world model based on event-level prediction, WALL-WM, breaking the limitations of traditional time-frame-based learning. It shifts the prediction unit to semantic events, enhancing the robot's ability to understand and perform tasks, marking a new stage in the industry.