UNIVERSITY PARK, Pa. — Machine learning programs that can classify leaves and place them into biological families may reveal new clues about the evolution of plant life, but only if scientists understand what computers are seeing. A team led by Penn State scientists combined a machine learning approach and traditional botanical language to find and describe new features for fossil identification.
“You have the computer that says, ‘look here, it’s important,’ but there has to be someone who can translate the results into human terms,” said Edward Spagnuolo, a recent Penn State graduate with a bachelor’s degree in geobiology who conducted the research. “So that’s really what we did. It’s really a first step in merging artificial intelligence with botany and paleobotany.
The team took heatmaps produced by machine learning programs – images of leaves covered in small red boxes that highlight areas the computer identified as important for identification – and developed a system manual notation to analyze these regions in different plant families.
“We basically found that each family had a unique suite of characteristics that were highlighted by the heatmaps,” Spagnuolo said. “And all of these features provide new leads for identifying fossil leaves. You can’t pull them out and directly identify the fossils yet, but it’s a first step. For some families, these are the only leads we have”
Leaves are the most common non-microscopic plant part found today and in the fossil record, but they are also the most difficult to identify. The variation in leaf shape and venation — the pattern of veins in a leaf blade — is too complex to be captured by botanical terminology, the scientists said.
This is especially difficult for paleobotanists, who most often find isolated fossil leaves without seeds, fruits, or flowers that could help identify plants. To further complicate the challenge, many individual fossils represent extinct plants.
“Evolutionary history and the fossil record are very poorly understood, even for some of the most important and diverse plant families alive today, and that is the impetus for this study,” said Spanish. “There are millions and millions of leaf fossils stored in museum collections around the world that cannot be identified because we simply don’t have well-defined leaf structures to place them into proper groups. “
Describing a single leaf can take a skilled researcher hours, but computer programs can learn to spot the differences and sort the leaves into taxonomic families quickly and accurately, the scientists said.
Peter Wilf, a Penn State geosciences professor and adviser to Spagnuolo, and Thomas Serre, a computer science professor at Brown, conducted a preliminary machine-learning study of more than 7,500 images of cleaned leaves, which are specimens that have been chemically bleached, stained and mounted on slides to reveal venation patterns. The program placed the leaves into families with 72% accuracy and produced the heatmaps that scientists can use to find out what the computer considers important for identification.
“This approach is different from most botanical and paleobotanical leaf studies, which will look at large-scale leaf characteristics — the number of veins, the shape of the leaf,” Spagnuolo said. “These are very small harvests of images. And moving forward, we need a way to combine the larger-scale botanical features that we’ve been using for centuries that also take into account those smaller-scale features that have been missed because they are so hard to see without the help of AI algorithm.
Spagnuolo analyzed more than 3,000 heatmaps showing leaves from 930 genera in 14 families of angiosperms or flowering plants. He noted the top five and the first hotspot regions and used traditional botanical language to describe their locations on the leaves.
“We attempted to decode the family-level identification of the leaves cleaned by the machine learning algorithm by mapping the location of the hottest hotspots,” Spagnuolo said. “This is, to our knowledge, the first attempt to back-translate and interpret computer vision heatmaps into botanical language.”
They recently reported their findings in the American Journal of Botany.
Some families like the Rosaceae – which include plants that produce apples, strawberries, plums, cherries, peaches and almonds – have distinctive characteristics that botanists and paleobotanists can easily identify, such as narrow teeth. The hotspots in these families appear to echo traditional observations, the scientists said.
Other families like the Rubiaceae, or the coffee family, lack distinguishing features and are largely unidentified in the fossil record. On these non-toothed leaves, the computer pointed to the microcurvature of the little-studied leaf margins.
“These new features can lead to further studies to hopefully delineate new fossil identification traits,” Spagnuolo said. “It could one day help unlock the huge amount of evolving dark data that we just haven’t tapped into yet.”
Wilf and Serre contributed to this work.
The National Science Foundation and a Penn State Erickson Discovery Grant provided funding.