Dispite my own promises, here is a hybrid followup for both BOORU_CHARS datasets
and safebooru centric composite rips.
This time a main source was danbooru (safe+questionable, interval ID 6640000…8200000 = 31.08.2023…24.09.2024),
the best of furry-related e621 and loli-enabled gelbooru for the same interval used as addon.
Similar to rips :
- images initially filtered Mpixels>=0.48, shorter_side>=600 px, volume>=60000 bytes, no animations
stripes dropped or adjusted to aspect ratio 0.4…2.1
- PNG/WEBP/AVIF converted to JPG using cjpegli 96% quality (2000000 bytes limit)
modest sampling done to typical longer side 2560px (landscape) 1920px (1x1) 2480px (portrait)
- verbose file naming used "%website% - %id% - %up_to_3_copyrights% ~ %up_to_5_characters% (%up_to_2_artists%).%ext%"
files uniquely identified by “%website%+%id%”
Similar to BOORU CHARS datasets extensive processing done and used for content sorting :
- some general image statistics got with EXIFTOOL and IMAGE MAGICK
- content analisys mostly the same as for BC2023 with actual software and models
- imageboard tags arranged and partially placed inside image EXIF-info
- clustering implemented both
- by aspect ratio { 7x10 +/-4% ; 3x4 +/-10% ; 1x1 +/-20% ; 3x2 +/-40% ; 2x3 +/-40% }
- by head-count { 0 heads = letter A, 2 = B, 3-5 heads = C, 6+ heads = D, 1 = letter E }
- sorting inside cluster based on “attractiveness score function” == “colorful and textless”
- balanced folder/zip typically contains ~1000-2600 files
- least rated images tend to be manga-like and manually reviewed
Content is a little less processed and a little more NSFW compared to predecessors, nevertheless :
- real-life photos, no-character landscapes, foods and macro thrown away
- most of comic and N-koma, overtexted images and line-arts filtered out
- too “questionable” images (uncensored nipples or vulva, obvious adult actions) excluded
separate BOORU BOOBS planned
- some background crops, gamma correction, rotation, denoise and other nontrivial improvements implemented
Images deduplicatied with AntiDupl dot NET up to 2% similarity along with BOORU CHARS 2023 and 2022.
Beside images release contains tab separated texts :
- BC_2024.tsv file/image related metadata 1.260.629 rows
- BC_2024_tags.tsv tags list with Danbooru enrichment 49.041.220 rows
- BC_2024_yolo.tsv detailed results for torso components detection 4.431.887 rows
and also dedicated “readme” with structures description.
Keep in mind this release is first of all
a dataset of character-centric art in effective local format suited for batch processing
and then
a representative catalog of anime/game/cartoon copyrigths, characters and artists for visual estimation
but not
a complete and maximum quality rip.
Some tips on use cases :
@REM -- explore torrent
for %%F in ("d:\torr\BOORU_CHARS_2024\2024-3x4\*.zip") do 7z x -r -o"C:\TEMP\" "%%F" *sousou*frieren*stark*
@REM -- much more effective if unzipped
xcopy /s "A:\BCA\*sousou*frieren*stark*" C:\TEMP\
-- and became sophisticated using database
select 'xcopy "'||bc.fpath||'\'||bc.fname||'" C:\TEMP\' xcpy
from bc
join bc_dt d on d.booru=bc.booru and d.fid=bc.fid
where bc.fname like '%dungeon%meshi%senshi%' and d.tag='pantyshot' -- brutal dwarf fanservice
Comments - 0