Как я могу получить какое-то значение из XML int Python?

Question

Как я могу получить какое-то значение из XML int Python?

1

У меня есть этот файл Sitemap в xml. Как я могу получить каждый <loc>?

<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->


<url>
  <loc>https://www.nsnam.org/wiki/Main_Page</loc>
  <lastmod>2018-10-24T03:03:05+00:00</lastmod>
  <priority>1.00</priority>
</url>
<url>
  <loc>https://www.nsnam.org/wiki/Current_Development</loc>
  <lastmod>2018-10-24T03:03:05+00:00</lastmod>
  <priority>0.80</priority>
</url>
<url>
  <loc>https://www.nsnam.org/wiki/Developer_FAQ</loc>
  <lastmod>2018-10-24T03:03:05+00:00</lastmod>
  <priority>0.80</priority>
</url>

Программа выглядит так.

import os.path
import xml.etree.ElementTree
import requests
from subprocess import call

def creatingListOfBrokenLinks():
    if (os.path.isfile('sitemap.xml')):
        e = xml.etree.ElementTree.parse('sitemap.xml').getroot()
        file = open("all_broken_links.txt", "w")

        for atype in e.findall('url'):
            r = requests.get(atype.find('loc').text)
            print(atype)
            if (r.status_code == 404):
                file.write(atype)

        file.close()


if __name__ == "__main__":
    creatingListOfBrokenLinks()

krax1337 25 окт. 2018, в 16:01

Источник

0

Как выглядит ваша Python-программа на данный момент?
frankenapps 25 окт. 2018, в 13:15
0

@frankenapps я обновил
krax1337 25 окт. 2018, в 13:22

Теги:

python

xml

elementtree

2 ответа

0

Ваш код работал отлично на моем конце. Все, что вам нужно было сделать, это добавить: {http://www.sitemaps.org/schemas/sitemap/0.9} перед url и loc

Вот:

import os.path
import xml.etree.ElementTree
import requests
from subprocess import call

def creatingListOfBrokenLinks():
    if (os.path.isfile('sitemap.xml')):
        e = xml.etree.ElementTree.parse('sitemap.xml').getroot()
        file = open("all_broken_links.txt", "w")

        for atype in e.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
            r = requests.get(atype.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text)
            print(atype)
            if (r.status_code == 404):
                file.write(atype)

        file.close()


if __name__ == "__main__":
    creatingListOfBrokenLinks()

Thaer A 25 окт. 2018, в 10:52

Ещё вопросы

Как выглядит ваша Python-программа на данный момент?

codeape · Accepted Answer · 2018-10-25T10-53-00.000Z

Я предлагаю вам использовать стандартный пакет библиотеки elementtree:

from xml.etree import ElementTree as ET

SITEMAP = """<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->
    ...
    ...
</urlset>"""

urlset = ET.fromstring(SITEMAP)
loc_elements = urlset.iter("{http://www.sitemaps.org/schemas/sitemap/0.9}loc")
for loc_element in loc_elements:
    print(loc_element.text)

Ссылки на документацию:

Обновить:

То, что ваш код ошибается, - это обработка пространства имен XML.
Кроме того, мой пример использует .iter() вместо .findall()/.find(), чтобы получить loc элементы непосредственно. Это может быть или не быть нормально в зависимости от структуры XML и варианта использования.