Document Type : علمی - پژوهشی
Authors
GIS Department. Geomatics Engineering Faculty, K. N. Toosi University of Technology, Tehran, Iran
Abstract
Introduction: Place names, a common form of embedded geographic information in natural language texts, are used in various resources such as social media, news stories, historical archives, and property listings. The names are presented in different forms like business addresses, hashtags, or simple texts. Providing up-to-date data, carrying human experience and cognition, and containing types of geospatial information only available in tex-tual resources make these resources precious for geospatial analyses. Therefore, mapping place names to their footprints is an essential task. One of the solutions for this task is using a digital gazetteer, a dictionary of place names. These precious resources enable Geographic Information Retrieval (GIR) systems to detect place names (geotagging) and convert the candidate ones to their geographic coordinates (geocoding). To fulfill ever-increasing geospatial demands, especially in GIR and LBSs, digital gazetteers should be enriched.
Materials and Methods: This paper presents a three-tier framework to extract urban geographic information from geotagged housing listings. The first tier is devoted to harvesting main street and neighborhood place names, which the authors usually write without any linguistic clue due to their well-knownness. Using a random forest model based on a set of spatial measures for each extracted n-gram from the textual content of real estate advertisements enables us to identify the main streets and neighborhoods. The first tier commences with the ex-traction of n-grams from the saved advertisements. After cleaning and standardizing the n-gram set, spatial clus-tering is applied, considering that each spatial n-gram can refer to multiple regions of the city. The defined spa-tial predictors are computed for each not-clustered n-gram or split n-gram from its generic cluster. Subsequent-ly, a random forest model identifies the neighborhood and the main street n-grams. We developed a rule-based model to extract all urban place names in the second tier and a linguistic pattern-based model to extract spatial relationships in the third tier. This research focused on the Persian language and Tehran, Mashhad, Isfahan, and Shiraz metropolises from Iran as study regions.
Results and Discussion: The results are encouraging for the first tier, specifically achieving approximately 0.8 and 0.7, respectively, for recall and precision in predicting another metropolis’s main streets and neighborhoods. However, differences in population levels and urban development patterns decreased the performance in identi-fying a neighborhood as a main street or vice versa. For the second tier, precision and recall are near 0.7. Alt-hough these results are notable compared to the performance of named entity recognition models in extracting urban place names which are often fine-grained, errors in this layer have led to reduced precision and recall in the third layer, spatial relation extraction.
Conclusion: Gazetteers are important geospatial resources in GIR tasks, especially in geoparsing. This paper presented a framework for extracting urban geographic information from online property listings. This geo-graphic information includes the place names and the spatial relationships to enrich current gazetteers. Since main streets and neighborhoods as a part of place names are well-known, people mainly use them without any clue on property listing websites. Harvesting these place names can be done using a machine learning-based model. The next step is extracting all place names written in the property advertisement posts. To realize that, we developed a rule-based model to extract potential place names from the posts geographically located in the neighborhood/main street place name’s convex-hull and remove the wrong identified cases. In the third step, we extracted spatial relationships between the place names extracted from each post text based on linguistic patterns. The framework has provided good results in harvesting main streets and neighborhoods and extracting place names. Extracting spatial relationships between the place names needs further work.
Keywords